Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Resolution failed - socket busy #12568

Closed
1 task done
henrythethird opened this issue Feb 15, 2024 · 21 comments
Closed
1 task done

DNS Resolution failed - socket busy #12568

henrythethird opened this issue Feb 15, 2024 · 21 comments
Labels
dns pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...

Comments

@henrythethird
Copy link

henrythethird commented Feb 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

Kong 3.5 and 3.6

Current Behavior

Requests fail randomly with the following log:

[error] 1282#0: *5585 [lua] init.lua:371: 
execute(): DNS resolution failed: failed to send request
 to UDP server 10.0.0.10:53: socket busy. 
Tried: ["(short)xxx:(na) - cache-miss","xxx.local:33 - cache-miss/querying"], 
client: xxx, server: kong, request: "GET /path HTTP/1.1", host: "xxx", request_id: "5a27c00c11b735727f5c452bf425f26d"

The client gets the following message back:

{ message:
       'failed the initial dns/balancer resolve for \'xxx\' with: failed to send request to UDP server 10.0.0.10:53: socket busy',
      request_id: '2ae16d137decfc5bd7a5051975d9b922' }

Expected Behavior

Requests go through as per Kong 3.4.2

Steps To Reproduce

I don't know, since it happens only sporadically
Edit: It happens quite frequently when the configuration option dns_no_sync is turned off (so dns sync is on -> default behavior as of Kong 3.5.0) and quite sporadically with it turned on.

We have the following setup:

  • Declarative config file
  • Kubernetes. The redacted xxx host paths are from K8s internal DNS
  • Mostly default configuration (with custom plugin)

Anything else?

It works perfecly with Kong 3.4.2. We tried upgrading back in November to 3.5, which gave us the same flaky behavior. Found some very old issues, which seem to be resolved.

@henrythethird
Copy link
Author

Could it be related to this issue (currently open): #12290
It's not the same error in the logs, but a similar response and related to DNS as well

@henrythethird
Copy link
Author

henrythethird commented Feb 19, 2024

We found the change that causes worsens our issue: #11871

We updated our kong.conf with dns_no_sync = on (the name confused me a bit, first I turned it off, thinking it would turn off dns_sync, then I noticed the no in there)

This resolves our problem, but the underlying issue is still there if someone cares to investigate
The problem persists, but happens quite rarely.

@henrythethird henrythethird changed the title DNS Resolution failed - socket busy DNS Resolution failed - socket busy with dns_no_sync off Feb 19, 2024
@henrythethird henrythethird changed the title DNS Resolution failed - socket busy with dns_no_sync off DNS Resolution failed - socket busy (related to **dns_no_sync off**) Feb 19, 2024
@henrythethird henrythethird changed the title DNS Resolution failed - socket busy (related to **dns_no_sync off**) DNS Resolution failed - socket busy Feb 19, 2024
@henrythethird
Copy link
Author

Possibly related to #3223?

@henrythethird
Copy link
Author

Any input on this?

@chronolaw
Copy link
Contributor

@chobits , could you give us some comments about this?

@chronolaw chronolaw added the dns label Mar 5, 2024
@chobits
Copy link
Contributor

chobits commented Mar 5, 2024

will look into it later

@chobits
Copy link
Contributor

chobits commented Mar 5, 2024

socket busy

This might be sock:setpeername() is trying to resolve the name, during this process, if someone tries to execute sock:send(), then it will report socket busy error

We updated our kong.conf with dns_no_sync = on, The problem persists, but happens quite rarely.

So it is as expected when you enable sync, which means a large number network queries for the same domain at the same time in one worker have been merged into one coroutine. It is like fewer DNS queries trigger fewer DNS errors.

If you could provide debug log in the context of error message, it would be beneficial to debug for us.

@chronolaw chronolaw added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Mar 5, 2024
@henrythethird
Copy link
Author

henrythethird commented Mar 7, 2024

Hi @chobits, thanks for chipping in

Here are the logs. I included some startup logs as well so you can see the configuration. It happened right off the bat as well
Edit: I tried with Kong 3.6.1 (the latest available version on Dockerhub)
kong-logs.csv

The problem is quite severe for us (at least with the DNS sync enabled) and happens even if there are very few request being made. The example above are a few system tests that run through. I guess that there are about 50 calls within 2 minutes. I don't imagine it is a heavy load problem, rather a configuration problem on our end or something else

Some more information that could possibly be relevant:

  • We have a custom auth plugin installed with a few lua deps
  • We use the official ubuntu image and update the dependencies in our custom dockerfile (because of our vulnerability management policies)
  • We are inside a Google Kubernetes Cluster and behind a Load Balancer that provides the SSL-Support
  • We deploy via a custom helm chart, since we just need a minimal setup

@chobits
Copy link
Contributor

chobits commented Mar 18, 2024

Sry for late reply, I have reviewed your kong-logs.csv, but I dont find some useful logs to help us to investigate. Because most useful debug logs have been commentted in DNS client library.

And I guess there might be some incorrect DNS resoluton operate executed by sock:connect() in lua-nginx-module, I still need to spend some time on that and then tell you how to debug it.

BTW, Is xxx hostname in your logs a hidden domain for the real domain name?

@chobits
Copy link
Contributor

chobits commented Mar 18, 2024

hi I have thought of a method to mitigate your issue, you could configure dns_stale_ttl from 4s to a longer value. Currently in kong latest version, we have set it to 3600s, you could also set it to this value, which gives your dns records a longer lifespan.

@henrythethird
Copy link
Author

henrythethird commented Mar 18, 2024

Thanks for the reply.

Yes xxx is a real kubernetes domain name(s) (internal DNS)

It seems like the staleTtl is already at 3600 as far as I can tell. Please correct me if I'm wrong here. Here a snippet from the logs that describes the DNS settings more verbosely (staleTtl being the second from the bottom):

"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:647: init(): [dns-client] emptyTtl = 30 s","2024-03-07T08:24:08.956815518Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:645: init(): [dns-client] badTtl = 1 s","2024-03-07T08:24:08.956632467Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:633: init(): [dns-client] search = king.svc.cluster.local, svc.cluster.local, cluster.local, google.internal","2024-03-07T08:24:08.956478982Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:631: init(): [dns-client] ndots = 5","2024-03-07T08:24:08.956330928Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:627: init(): [dns-client] timeout = 2000 ms","2024-03-07T08:24:08.956134323Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:618: init(): [dns-client] no_random = true","2024-03-07T08:24:08.955963978Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:611: init(): [dns-client] attempts = 5","2024-03-07T08:24:08.955823817Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:606: init(): [dns-client] nameserver 10.0.0.10","2024-03-07T08:24:08.955685968Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:565: init(): [dns-client] validTtl = nil","2024-03-07T08:24:08.955670959Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: localhost = [::1]","2024-03-07T08:24:08.955662768Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:541: init(): [dns-client] adding A-record from 'hosts' file: localhost = 127.0.0.1","2024-03-07T08:24:08.955627713Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allrouters = [fe00::2]","2024-03-07T08:24:08.954856727Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localnet = [fe00::0]","2024-03-07T08:24:08.954672873Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allnodes = [fe00::1]","2024-03-07T08:24:08.954364051Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-loopback = [::1]","2024-03-07T08:24:08.954047780Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localhost = [::1]","2024-03-07T08:24:08.953833227Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:541: init(): [dns-client] adding A-record from 'hosts' file: xxx = xxx","2024-03-07T08:24:08.953549257Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-mcastprefix = [fe00::0]","2024-03-07T08:24:08.953099393Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:501: init(): [dns-client] query order = LAST, SRV, A, CNAME","2024-03-07T08:24:08.952530616Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:482: init(): [dns-client] noSynchronisation = false","2024-03-07T08:24:08.950206705Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:478: init(): [dns-client] staleTtl = 3600","2024-03-07T08:24:08.950063140Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:473: init(): [dns-client] (re)configuring dns client","2024-03-07T08:24:08.949858585Z"

We don't override any DNS related options in the config file. Here our current kong.conf (injected into the kong:3.4.2 image):

plugins = auth-internal
declarative_config = /etc/kong/kong.yml
nginx_http_lua_shared_dict=auth_internal 10m
# Change for kong>=3.5 cf https://github.com/Kong/kong/issues/12568
# dns_no_sync = on

Thanks @chobits for your support 🙏

@chobits
Copy link
Contributor

chobits commented Mar 25, 2024

hi @henrythethird

Could you try this docker image: kong/kong:3a034e9417743fcf946deb6f98d2993049b1e589-ubuntu which is built from #12305 based 3.6.x.x. It has passed all our test cases.

In this PR, we refactor our DNS client, if you could try this PR's docker image, it could help us to verify whether this issue still be there in new DNS client. If both versions of kong have this problem, I guess DNS client socket might be messed up by some other subsystem in kong.

@henrythethird
Copy link
Author

henrythethird commented Mar 26, 2024

Thanks for the update @chobits. We get the following error with kong/kong:3a034e9417743fcf946deb6f98d2993049b1e589-ubuntu:

2024/03/26 08:27:37 [error] 1331#0: *90612 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/resty/dns_client/init.lua:669: attempt to call method 'lower' (a nil value)
stack traceback:
coroutine 0:
	/usr/local/share/lua/5.1/kong/resty/dns_client/init.lua: in function 'execute_original_func'
	/usr/local/share/lua/5.1/kong/dynamic_hook/init.lua:144: in function 'raw_func'
	/usr/local/share/lua/5.1/kong/tracing/instrumentation.lua:288: in function 'toip'
	/usr/local/share/lua/5.1/kong/runloop/balancer/init.lua:364: in function 'execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:832: in function 'balancer_execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:1352: in function 'after'
	/usr/local/share/lua/5.1/kong/init.lua:1186: in function 'access'
	access_by_lua(nginx-kong.conf:112):2: in main chunk, client: xxx, server: kong, request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"

Additionally in the beginning we get the following (still the same)

[error] 1330#0: *90586 [lua] init.lua:371: execute(): DNS resolution failed: DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000. Tried: [["cluster.address.local:SRV","DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000"]], client: xxx, server: kong, request: "GET /path/to/resource", host: "xxx", request_id: "0b7dc268d13c6648f089681aa55df436"

@chobits
Copy link
Contributor

chobits commented Mar 26, 2024

Thanks for the update @chobits. We get the following error with kong/kong:3a034e9417743fcf946deb6f98d2993049b1e589-ubuntu:

2024/03/26 08:27:37 [error] 1331#0: *90612 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/resty/dns_client/init.lua:669: attempt to call method 'lower' (a nil value)
stack traceback:
coroutine 0:
	/usr/local/share/lua/5.1/kong/resty/dns_client/init.lua: in function 'execute_original_func'
	/usr/local/share/lua/5.1/kong/dynamic_hook/init.lua:144: in function 'raw_func'
	/usr/local/share/lua/5.1/kong/tracing/instrumentation.lua:288: in function 'toip'
	/usr/local/share/lua/5.1/kong/runloop/balancer/init.lua:364: in function 'execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:832: in function 'balancer_execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:1352: in function 'after'
	/usr/local/share/lua/5.1/kong/init.lua:1186: in function 'access'
	access_by_lua(nginx-kong.conf:112):2: in main chunk, client: xxx, server: kong, request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"

Did it constantly occur if you request this request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"?

Did your plugin call this kind of methods: kong.service.set_target() or kong.service.set_upstream().

From kong's logic, it seems that the balancer_data.host should be a string, so if we call balancer_data.host:lower(), it will success while not reporting a runtime error.

Additionally in the beginning we get the following (still the same)

[error] 1330#0: *90586 [lua] init.lua:371: execute(): DNS resolution failed: DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000. Tried: [["cluster.address.local:SRV","DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000"]], client: xxx, server: kong, request: "GET /path/to/resource", host: "xxx", request_id: "0b7dc268d13c6648f089681aa55df436"

So it might be the dns socket messed up, but currently your image has no nginx debug log enabled like --with-debug, so if you enable debug level, we could not get debug log from nginx's logic/lua-nginx-module's logic, only get the debug log from kong's lua land. Let me think of how to trace it.

@henrythethird
Copy link
Author

henrythethird commented Apr 2, 2024

Did it constantly occur if you request this request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"?

No only sporadically. It happens even less with the dns_no_sync turned on

Did your plugin call this kind of methods: kong.service.set_target() or kong.service.set_upstream().

No we don't call any of these methods. We have a static yaml that sets those parameters

@henrythethird
Copy link
Author

henrythethird commented Apr 16, 2024

I just tried again with the latest 3.6.1-Dockerfile and now it seems to work 😕. Strange, but I'll take it. Thanks for your support @chobits. Must be something with the dependencies I'd imagine

@henrythethird
Copy link
Author

False alarm. Still happens, just less often

@henrythethird henrythethird reopened this Apr 16, 2024
@henrythethird
Copy link
Author

Because of a hunch I replaced the http dependency of our plugin with resty.http. So far we haven't had any more issues. Seems like for some reason the lua-http dependency was hogging the DNS-Socket. Not sure why that was only happening with newer versions of Kong, probably some kind of race condition. Anyways, cheers for the support

@mayank-allen
Copy link

mayank-allen commented Sep 24, 2024

Because of a hunch I replaced the http dependency of our plugin with resty.http. So far we haven't had any more issues. Seems like for some reason the lua-http dependency was hogging the DNS-Socket. Not sure why that was only happening with newer versions of Kong, probably some kind of race condition. Anyways, cheers for the support

Hi @henrythethird could you please help what exact change did you make, how did you replace the dependency ?
eagerly waiting for your reply

@Tieske
Copy link
Member

Tieske commented Sep 25, 2024

The http library is not suited for use with OpenResty, if combined it will result in blocking IO.

@henrythethird
Copy link
Author

henrythethird commented Oct 2, 2024

@mayank-allen
This is very specific to our project. I can give you a snippet of what I did though:

Removed

    local request = http.new_from_uri(uri)

    request.follow_redirects = false
    request.headers:upsert(header_key, key)
    request.headers:upsert(":method", "POST")

    local headers, stream = request:go(10)

Instead added

    local http_client = http.new()

    local response, err = http_client:request_uri(url, {
        method = "POST",
        path = path,
        headers = {
            [header_key] = key,
        }
    })

This change is also reflected in the .rockspec file:

-  "http ~> 0.4-0",
+  "lua-resty-http >= 0.11",

@Tieske it's maybe worth adding that to the documentation then (if it's not already in there). We struggled with this quite a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dns pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...
Projects
None yet
Development

No branches or pull requests

5 participants