Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty results from MongoDB Atlas (no document returned) #131

Closed
why-not-try-calmer opened this issue Jul 17, 2022 · 19 comments
Closed

Empty results from MongoDB Atlas (no document returned) #131

why-not-try-calmer opened this issue Jul 17, 2022 · 19 comments

Comments

@why-not-try-calmer
Copy link
Contributor

why-not-try-calmer commented Jul 17, 2022

I really hope this is unrelated to #130 (edit: ah apparently it's not since stackage does not provide the version implementing the PR yet) but after switching to Mongo Atlas for a side project I started facing outstanding issues for reading data from a Cursor (writing and all the rest is fine though). Consider the last few lines in the snippet below:

userName = "..."
userPass = "..."
hostName = "..."
dbName = "..."
collectName = "..."

primaryOrSecondary :: ReplicaSet -> IO (Maybe Pipe)
primaryOrSecondary rep =
    try (primary rep) >>= \case
        Left (SomeException _) ->
            try (secondaryOk rep) >>= \case
                Left (SomeException _) -> pure Nothing
                Right pipe -> pure $ Just pipe
        Right pipe -> pure $ Just pipe

setup :: IO (Either String Pipe)
setup = do
    repl <- openReplicaSetSRV' hostName
    mb_pipe <- primaryOrSecondary repl
    case mb_pipe of
        Just pipe -> do
            authed <-
                access pipe master "admin" $
                    auth userName userPass
            if authed
                then pure $ Right pipe
                else pure $ Left "Authentication failed"
        Nothing -> pure $ Left "Unable to create pipe!"

test :: IO ()
test =
    setup >>= \case
        Left err -> print err
        Right pipe -> do
            (dist, found) <- access pipe master myColl $
                (,) <$> count_distinct <*> count_found
            -- whenever 'myColl' is non-empty, the 'when'-clause below will trigger
            -- that's quite pathological :(
            when (dist > 0 && dist /= found) $ putStrLn "Failed"
  where
    count_distinct = distinct "_id" (select [] myColl) <&> length
    count_found = find (select [] myColl) >>= rest <&> length

Any idea what could have gone wrong?

@why-not-try-calmer
Copy link
Contributor Author

why-not-try-calmer commented Jul 19, 2022

Okay so I did some research and here's my recap:

  • something happened with Mongo Atlas that is not seen from their perspective as a problem (it's clearly not a bug or regression from what I've got from interacting with their support, even though the people I've talked to were not able to say precisely what change is the cause of this issue)
  • the problem: writes to Atlas databases work fine, partial reads (= matches) work fine (see distinct above), but full-reads are completely broken. This means that none of the functions provided by this library that return values of the Document type work. (This goes beyond my earlier assumptions, which was that Cursor was broken. Either it is not broken or it's not alone). This makes the library unusable on any Mongo Atlas project for now.

If like me this issue has caught you off guard and is ruining production code, a simple workaround is to use MongoDB Atlas Data API feature to read from there. It's a band-aid at most but it keeps the boat afloat.

@why-not-try-calmer why-not-try-calmer changed the title Cursor empty / closed early resulting in empty results Empty results from Mongo DB Atlas (no documents returned) Jul 19, 2022
@why-not-try-calmer why-not-try-calmer changed the title Empty results from Mongo DB Atlas (no documents returned) Empty results from MongoDB Atlas (no documents returned) Jul 19, 2022
@why-not-try-calmer why-not-try-calmer changed the title Empty results from MongoDB Atlas (no documents returned) Empty results from MongoDB Atlas (no document returned) Jul 19, 2022
@why-not-try-calmer
Copy link
Contributor Author

@KovaxG Can you confirm it's the same issue you're seeing?

@KovaxG
Copy link

KovaxG commented Jul 19, 2022

From my perspective, a week ago I could insert and query data normally. A few days ago I noticed that my queries return empty lists all the time. Inserts work normally, I have confirmed this by looking up the collections in atlas.

@why-not-try-calmer
Copy link
Contributor Author

I've set up a small testing repository: https://github.com/why-not-try-calmer/test-mongo

Alex Bevilacqua from MongoDB has accepted to help troubleshoot this issue. Will keep you posted.

@alexbevi
Copy link

alexbevi commented Jul 29, 2022

To keep everyone updated it appears this behavior only affects shared/free tier Atlas clusters (M0/M2/M5). The reproduction @why-not-try-calmer shared does not fail on an M10 or above, so we can at least narrow the scope of this issue to any plumbing that sits between your Haskell applications and the cluster (such as a proxy). Note this would be Atlas infrastructure; nothing you've outright configured yourself ;)

Short term, if you upgrade to an M10+ this issue should no longer impact your applications; however this may not be ideal in all scenarios as there is a cost associated with the upgrade.

I'll keep this issue updated as more information surfaces, however I wanted to share an update in the meantime.

@why-not-try-calmer
Copy link
Contributor Author

To keep everyone updated it appears this behavior only affects shared/free tier Atlas clusters (M0/M2/M5). The reproduction @why-not-try-calmer shared does not fail on an M10 or above, so we can at least narrow the scope of this issue to any plumbing that sits between your Haskell applications and the cluster (such as a proxy). Note this would be Atlas infrastructure; nothing you've outright configured yourself ;)

Short term, if you upgrade to an M10+ this issue should no longer impact your applications; however this may not be ideal in all scenarios as there is a cost associated with the upgrade.

I'll keep this issue updated as more information surfaces, however I wanted to share an update in the meantime.

Thank you for the update! I am happy about the narrowing-down. I think the main worry here is that a library is supposed to treat all consumers equally, and this can no longer be guaranteed if the library fails for consumers using <M10 Atlas clusters. :-)

Please let us know if there's anything more we can do to help troubleshoot the issue.

@why-not-try-calmer
Copy link
Contributor Author

why-not-try-calmer commented Aug 11, 2022

@VictorDenisov Just pinging you to know if you're following the situation here. Also to know: if in the next weeks a fix can be produced, how much time does it need before it hits Hackage once the PR is accepted? (I believe my previous merged PR, months ago, has not hit Hackage yet)

@VictorDenisov
Copy link
Member

It's only a matter of how soon you need it in Hackage.

@why-not-try-calmer
Copy link
Contributor Author

why-not-try-calmer commented Aug 11, 2022

It's only a matter of how soon you need it in Hackage.

Okay, thank you very much for you reply! We're trying to narrow down on the culprit and this is likely to trigger some small updates to the library along the way from me, even if we don't manage to put our finger on the exact culprit.

I'd like to finish with this before PR-ing transactions, otherwise the testing will be god awful.

@alexbevi
Copy link

Just to provide an update to anyone watching this ticket. We believe we've found the issue and fixing it will not require any changes to the Haskell driver. Once this has been actioned I'll share more information.

@why-not-try-calmer
Copy link
Contributor Author

why-not-try-calmer commented Aug 19, 2022

Just to provide an update to anyone watching this ticket. We believe we've found the issue and fixing it will not require any changes to the Haskell driver. Once this has been actioned I'll share more information.

That's amazing! Please share as many crispy details as you're allowed to, it's been a true Netflix show!

@alexbevi
Copy link

alexbevi commented Aug 19, 2022

That's amazing! Please share as many crispy details as you're allowed to, it's been a true Netflix show!

TL;DR is it looks like a bug was introduced as a result of a proxy update to make currentOp filtration better. This would only affect drivers that don't specify client details during the initial handshake - such as the Haskell driver.

Once this is all sorted out I hope to write something up about it so I'll make sure I share that here as well :)

@why-not-try-calmer
Copy link
Contributor Author

why-not-try-calmer commented Aug 19, 2022

That's amazing! Please share as many crispy details as you're allowed to, it's been a true Netflix show!

TL;DR is it looks like a bug was introduced as a result of a proxy update to make currentOp filtration better. This would only affect drivers that don't specify client details during the initial handshake - such as the Haskell driver.

Once this is all sorted out I hope to write something up about it so I'll make sure I share that here as well :)

Cool! I will still make a PR to improve our handshake and get closer to compliance with the protocol/interface that you guys expect.

@codygman
Copy link

Cool! I will still make a PR to improve our handshake and get closer to compliance with the protocol/interface that you guys expect.

I would like to rule out my team being affected by this issue for some slightly different causes, is there some workaround while you work on a PR I could try?

@alexbevi
Copy link

alexbevi commented Aug 22, 2022

@codygman if you are connecting to an M0/M2/M5 cluster the issue may affect you. Upgrading to an M10 would address the issue until further fixes are implemented.

If you're already on an M10+ cluster tier (in Atlas) the issue wouldn't affect you. The same would go for self-managed clusters (not using Atlas)

@why-not-try-calmer
Copy link
Contributor Author

Cool! I will still make a PR to improve our handshake and get closer to compliance with the protocol/interface that you guys expect.

I would like to rule out my team being affected by this issue for some slightly different causes, is there some workaround while you work on a PR I could try?

I worked around this using the Mongo Data API.

@alexbevi
Copy link

alexbevi commented Sep 2, 2022

Second to last update - the MongoDB team has identified the issue, has a patch and is currently testing it. Once this has been released any Haskell applications that were affected by this issue should start working again. No client-side updates will be required.

Once I know this has been released to production I'll post here so we can close this out.

@alexbevi
Copy link

Final update - this issue has now been resolved for any MongoDB Atlas M0/M2/M5 deployment. No changes are required in your apps and they should now return results as expected. This should only be considered a temporary fix as legacy opcodes have been removed in MongoDB 6.0.

To ensure the Haskell driver continues to work with 6.0 and beyond issue #123 needs to be implemented.

For anyone curious as to what the actual issue was and how it was addressed I've written up the diagnostic journey that this bug took us on at https://www.alexbevi.com/blog/2022/09/21/bug-hunting-with-the-mongodb-haskell-community/.

@why-not-try-calmer
Copy link
Contributor Author

Wholehearted thanks to @alexbevi and MongoDB for not dropping the ball and for the fix! I will see to it that #123 comes to fruition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants