Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress server: Heroku reports H15 errors / 503 responses #181

Open
noahlevenson opened this issue Aug 22, 2023 · 2 comments
Open

Egress server: Heroku reports H15 errors / 503 responses #181

noahlevenson opened this issue Aug 22, 2023 · 2 comments

Comments

@noahlevenson
Copy link
Contributor

noahlevenson commented Aug 22, 2023

@oxtoacart pointed out that the egress server metrics indicate a bunch of 5XX responses for requests. This correlates with the H15 errors reported in the events.

If you tail the logs for the server, you can observe these things happening from time to time. It looks like this:

2023-08-22T23:43:31.603801+00:00 heroku[router]: at=error code=H15 desc="Idle connection" method=GET path="/ws" host=bf-egress.herokuapp.com request_id=403c23
3f-37c6-4371-ae53-253143b2e767 fwd="68.81.173.37" dyno=web.1 connect=0ms service=100175ms status=503 bytes= protocol=https                                    
2023-08-22T23:43:31.604881+00:00 app[web.1]: 2023/08/22 23:43:31 WebSocket connection 952f97a1-d237-4c45-b4fc-b9fdb3a5876e QUIC listener error (failed to get 
reader: failed to read frame header: EOF), closing!                                                                                                           
2023-08-22T23:43:31.604896+00:00 app[web.1]: 2023/08/22 23:43:31 Closed a WebSocket connection! (44 total)  

After reading about H15 errors, you might think that Heroku is just deciding that some of our WebSocket connections are idle and closing them down. But we implemented a keepalive to prevent this from happening, and at the time, it seems to have worked.

I thought: Is the keepalive now failing? It was set for 30s, so I reduced it to 15s and redeployed. But this didn't solve the problem!

@noahlevenson noahlevenson self-assigned this Aug 22, 2023
@noahlevenson
Copy link
Contributor Author

Given the content and order of the log messages when the error is reported, there are two possible interpretations of causality here:

  1. Heroku decides a WebSocket is idle and closes the connection, and then the server detects that the connection no longer exists and cleans it up.

  2. The server detects a dead connection (due to a communications partner who left abruptly) and closes the WebSocket, and then Heroku incorrectly interprets this as an idle condition, reporting an error for no reason.

This SO thread suggests it might be no. 2:

https://stackoverflow.com/questions/32728030/heroku-h15-error-on-web-socket-close

@noahlevenson
Copy link
Contributor Author

Another clue: If you pay attention to the timestamps in the log messages, it seems that the messages actually print out of order, which obfuscates causality. In other words, if you match up the log message from the egress server indicating WebSocket closure with the log message from Heroku reporting the error, it's actually the egress server message with the earlier timestamp -- even though the Heroku message reliably appears first.

The timestamp reversal does not occur when looking at messages resulting from a client-initiated healthy closure.
That is, when a client toggles the widget off, Heroku emits a log message reporting the outcome of the HTTP GET, and then the server emits a log message reporting the death of the WebSocket.

I've only looked at 10 or so message pairings -- but if these invariants hold, it suggests the following idea:

When a client disconnects in a healthy way, Heroku is the first to know about it (which makes sense), and so Heroku reports the outcome of the request first.

When a client disconnects in an unhealthy way, leaving the connection in a half open state, the server is the first to know about it, and so the server formally closes the request. Only then does Heroku see that the request has terminated, and so it reports the outcome -- and for some reason, it reports it as an H15 error.

This suggests that, in the case of the reported error, it's not Heroku closing our WebSockets and our server reacting to it -- rather, it's our server behaving correctly, and Heroku just interpreting things wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant