Postgres read/write primary node seeming timeouts led to runtime traffic failures on existing runtime proxies #13597

jeremyjpj0916 · 2024-08-30T07:13:57Z

Is there an existing issue for this?

I have searched the existing issues

Kong version (`$ kong version`)

3.7.1

Current Behavior

So on 8/8/2024 it seems something in our network or postgres cluster itself caused Kong in the network zone it was in to have to fail some calls with HTTP 500 level errors and failed to reverse proxy on service path(/api/cel/myservice/member/v1), pod logs showed this:

2024/08/08 07:23:35 [error] 1389147#0: *48961090 [lua] upstreams.lua:206: do_upstream_event(): upstream not found for b6e87b44-f251-4ede-b277-030c58f48c1b, context: ngx.timer
2024/08/08 07:23:36 [error] 1389149#0: *48960645 [lua] upstreams.lua:206: do_upstream_event(): upstream not found for b6e87b44-f251-4ede-b277-030c58f48c1b, context: ngx.timer
2024/08/08 07:23:36 [error] 1389150#0: *48963061 [lua] upstreams.lua:206: do_upstream_event(): upstream not found for b6e87b44-f251-4ede-b277-030c58f48c1b, context: ngx.timer
2024/08/08 07:23:37 [crit] 1389150#0: *48963293 [lua] upstreams.lua:98: could not obtain list of upstreams: [postgres] timeout, client: 10.xx.xxx.xx, server: kong, request: "GET /api/cel/myservice/member/v1 HTTP/1.1", host: "gateway.company.com", request_id: "1902136b57a2f27861e1b9499d35f95e"
2024/08/08 07:23:40 [error] 1389148#0: *48962101 [lua] upstreams.lua:206: do_upstream_event(): upstream not found for b6e87b44-f251-4ede-b277-030c58f48c1b, context: ngx.timer
2024/08/08 07:23:45 [crit] 1389147#0: *48950395 [lua] upstreams.lua:98: could not obtain list of upstreams: [postgres] timeout, client: 3.xx.xxx.xx, server: kong, request: "POST /api/pdr/hello/service/org/v1.0 HTTP/1.1", host: "gateway.company.com", request_id: "de7bf728ea504978e266251907094342"

Oddly all tx logs for the HTTP 500 logs and no proxy routing always showed Kongs time internally taking about 5000ms but the PG timeout field was set to 8s so I wonder why the 5s timeout plays into? Some default Kong PG connection timeout under the hood not configurable? DB Update frequency somehow playing a role set to 5s?:

        - name: KONG_PG_TIMEOUT
          value: "8000"
         - name: KONG_DB_UPDATE_FREQUENCY
          value: "5"
        - name: KONG_DB_UPDATE_PROPAGATION
          value: "0"
        - name: KONG_DB_CACHE_TTL
          value: "0"

Seems all worker processes in the kong node during the time of impact were involved too. Note we persist the cache ttl to never expire so services/routes etc. all should have been cached. The proxy throwing the 500 was essentially running a healthy 200 tx flow right up until impact.

Proxy was a service+route with acl plugin and a varient of Kong oauth2 cc flow plugin for auth too and they were using a existing valid token from prior calls.

We run a sh script in bg from the Kong node that keeps track of Kongs ability to write to the db since Kong doesn't expose any logic to track that otherwise too(hence this uuid you see in logs b6e87b44-f251-4ede-b277-030c58f48c1b, thats the uuid of the healthcheck_db_put singular upstream in this env):

validate_database_health()
{
  URL="localhost:8001/upstreams/healthcheck_db_put"
  # store the whole response with the status at the end
  HTTP_RESPONSE=$(curl -X PUT -i -v --silent --write-out "HTTPSTATUS:%{http_code}" $URL)
...

So while we don't use upstream resources in these kong nodes, and services all get configured to a existing hostname we do just put 1 dummy record out to the upstreams so it keeps a configured upstreams resource that shows the write to primary node is functional.

Expected Behavior

My expected behavior is that even when the primary postgres write node goes down and an existing oauth2 cc authorized consumer keeps sending their token across to call a proxy that the cached resources will keep working and allowing traffic even if the primary r/w node is in a brief moment of network connectivity issues.

It seems somehow that sometimes a given node of kong can get in a state where that scenario does not happen though and it has either lost its cached records or starts doing something unexpected to fail the proxy call at the 5s mark with a response size of 616 bytes guessing its the usual kong default thrown back at the client saying "unexpected error occurred" in json format like it usually will.

What makes this more interesting is that during this time only 1 of the 6 Kong nodes that talks to this 3 node postgres cluster(1 primary 2 secondary nodes) w all the same configurations started showing this 500 status error behavior. Hence why I was wondering if maybe in traditional mode some resiliency issues are present with Kong these days running against a postgres cluster where the cache isn't stable properly... or better yet at least if it does need to somehow read from a pg node it would be willing to try with other secondary read only nodes to get the data it wants before failing the proxy tx.

Image above shows the node with the 3k tx was the one throwing the 500s same interval as the other nodes (they were not handling most the traffic though as you can see was almost all routing through only 1 node at the time which was producing the 500s).

Further more I am willing to speculate the pg cluster itself was healthy during this window but something in between two network zones caused connectivity issues preventing Kong from reaching that PG primary node in a 8-9 minute span where the errors were being seen.

Could the PUT calls to the upstreams resource as a type of write health check sometimes be causing other Kong resources to cache evict or rebuild on some kinda odd race condition that doesn't always happen? I really don't know how you could even reproduce or debug this situation but best to just make sure Kong is as resilient as possible to db connectivity issues in traditional mode as well. Also still curious what yalls thoughts are with the 5 second mark being the sweet spot on the failed txs when my PG timeout setting is 8 seconds(maybe thats just for read/writes and 5s is some internal connection default?)

Steps To Reproduce

Not really possible as it was a complicated scenario with not predictable results(only 1 pod of 6 faced this problem during the network outage).

Anything else?

No response

The text was updated successfully, but these errors were encountered:

Water-Melon · 2024-09-02T07:10:36Z

Hi @bungle , could you take a look at this issue? thanks.

nowNick · 2024-09-09T07:07:28Z

@bungle, did you have a chance to look at this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres read/write primary node seeming timeouts led to runtime traffic failures on existing runtime proxies #13597

Postgres read/write primary node seeming timeouts led to runtime traffic failures on existing runtime proxies #13597

jeremyjpj0916 commented Aug 30, 2024 •

edited

Loading

Water-Melon commented Sep 2, 2024

nowNick commented Sep 9, 2024

Postgres read/write primary node seeming timeouts led to runtime traffic failures on existing runtime proxies #13597

Postgres read/write primary node seeming timeouts led to runtime traffic failures on existing runtime proxies #13597

Comments

jeremyjpj0916 commented Aug 30, 2024 • edited Loading

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

Water-Melon commented Sep 2, 2024

nowNick commented Sep 9, 2024

jeremyjpj0916 commented Aug 30, 2024 •

edited

Loading

Kong version (`$ kong version`)