runsc: PacketMMAP dispatch mode is buggy. #210

hbhasker · 2019-04-23T20:58:26Z

TPACKEtv1/v2 PACKET_RX|TX_RING buffers have a kernel bug where multiple kernel threads can incorrectly assume control of a ring buffer entry resulting in the ring buffer being out of sync with user-space. This seems to manifest itself in our dispatcher in a way where we get stuck in a poll loop where poll returns readable but the frame at our ring offset location does not have the user-bit set.

This completely stalls network. It seems to be easier to trigger when there are lots of connections to an external destination and streaming large amounts of data in.

The bug is described in this thread here

https://www.mail-archive.com/[email protected]/msg237222.html

This is not fixed in the kernel. There are ways to side step it by increasing number of frames which reduces the likelyhood of being hit by this bug. But from my experiments using smaller frames is not worth it as it requires GRO to be disabled but at the same time in a VM GRO/TSO is almost impossible to disable completely. See
http://patchwork.ozlabs.org/patch/1016373/

For now we should revert to use RecvMMsg dispatch mode for runsc while change the ring code to use TPACKET_v3 instead of v1.

This CL fixes the following bugs: - Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc which are not atomic. - Increments ringOffsets for frames that are truncated (i.e status is tpStatusCopy) - Does not ignore frames with tpStatusLost bit set as they are valid frames and only indicate that there some frames were lost before this one and metrics can be retrieved with a getsockopt call. - Adds checks to make sure blockSize is a multiple of page size. This is required as the kernel allocates in pages per block and rejects sizes that are not page aligned with an EINVAL. Updates #210 PiperOrigin-RevId: 244959464 Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f

PacketMMap mode has issues due to a kernel bug. This change reverts us to using recvmmsg instead of a shared ring buffer to dispatch inbound packets. This will reduce performance but should be more stable under heavy load till PacketMMap is updated to use TPacketv3. See #210 for details. Perf difference between recvmmsg vs packetmmap. RecvMMsg : iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 778 MBytes 6.53 Gbits/sec 4349 188 KBytes [ 4] 1.00-2.00 sec 786 MBytes 6.59 Gbits/sec 4395 212 KBytes [ 4] 2.00-3.00 sec 756 MBytes 6.34 Gbits/sec 3655 161 KBytes [ 4] 3.00-4.00 sec 782 MBytes 6.56 Gbits/sec 4419 175 KBytes [ 4] 4.00-5.00 sec 755 MBytes 6.34 Gbits/sec 4317 187 KBytes [ 4] 5.00-6.00 sec 774 MBytes 6.49 Gbits/sec 4002 173 KBytes [ 4] 6.00-7.00 sec 737 MBytes 6.18 Gbits/sec 3904 191 KBytes [ 4] 7.00-8.00 sec 530 MBytes 4.44 Gbits/sec 3318 189 KBytes [ 4] 8.00-9.00 sec 487 MBytes 4.09 Gbits/sec 2627 188 KBytes [ 4] 9.00-10.00 sec 770 MBytes 6.46 Gbits/sec 4221 170 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec 39207 sender [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec receiver iperf Done. PacketMMap: bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 657 MBytes 5.51 Gbits/sec 0 1.01 MBytes [ 4] 1.00-2.00 sec 1021 MBytes 8.56 Gbits/sec 0 1.01 MBytes [ 4] 2.00-3.00 sec 1.21 GBytes 10.4 Gbits/sec 45 1.01 MBytes [ 4] 3.00-4.00 sec 1018 MBytes 8.54 Gbits/sec 15 1.01 MBytes [ 4] 4.00-5.00 sec 1.28 GBytes 11.0 Gbits/sec 45 1.01 MBytes [ 4] 5.00-6.00 sec 1.38 GBytes 11.9 Gbits/sec 0 1.01 MBytes [ 4] 6.00-7.00 sec 1.34 GBytes 11.5 Gbits/sec 45 856 KBytes [ 4] 7.00-8.00 sec 1.23 GBytes 10.5 Gbits/sec 0 901 KBytes [ 4] 8.00-9.00 sec 1010 MBytes 8.48 Gbits/sec 0 923 KBytes [ 4] 9.00-10.00 sec 1.39 GBytes 11.9 Gbits/sec 0 960 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec 150 sender [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec receiver Updates #210 PiperOrigin-RevId: 244968438 Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b

hbhasker · 2019-04-25T20:27:18Z

Just an update.

I implemented tpacket_v3 based dispatcher, while that does not suffer from the bug in tpacket v1/v2 and does pretty good throughput on a loaded network, it does suffer from latency issues when the network is not busy.

This is because the way it works is each block is filled 1 at a time by the kernel and only when the block is full or hits the blockTimeout is the skb->data_ready() invoked. The lowest resolution for the blockTimeout is 1ms which is far too high for high speed networking and introduces a large amount of delay.

A simple ping of the host from gVisor can see latencies from 0.4ms -> 7ms. TPacket_V3 is not usable unless a new lower resolution timer is added.

The only path forward for PACKET_RX_RING will be to wait for a fix to the kernel bug.

I believe in the short term we should stick with recvmmsg. I am going to downgrade the priority of this bug as we don't really have a path forward at the moment to use PACKET_RX_RING.

This CL fixes the following bugs: - Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc which are not atomic. - Increments ringOffsets for frames that are truncated (i.e status is tpStatusCopy) - Does not ignore frames with tpStatusLost bit set as they are valid frames and only indicate that there some frames were lost before this one and metrics can be retrieved with a getsockopt call. - Adds checks to make sure blockSize is a multiple of page size. This is required as the kernel allocates in pages per block and rejects sizes that are not page aligned with an EINVAL. Updates google#210 PiperOrigin-RevId: 244959464 Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f Upstream-commit: 56cadca

PacketMMap mode has issues due to a kernel bug. This change reverts us to using recvmmsg instead of a shared ring buffer to dispatch inbound packets. This will reduce performance but should be more stable under heavy load till PacketMMap is updated to use TPacketv3. See google#210 for details. Perf difference between recvmmsg vs packetmmap. RecvMMsg : iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 778 MBytes 6.53 Gbits/sec 4349 188 KBytes [ 4] 1.00-2.00 sec 786 MBytes 6.59 Gbits/sec 4395 212 KBytes [ 4] 2.00-3.00 sec 756 MBytes 6.34 Gbits/sec 3655 161 KBytes [ 4] 3.00-4.00 sec 782 MBytes 6.56 Gbits/sec 4419 175 KBytes [ 4] 4.00-5.00 sec 755 MBytes 6.34 Gbits/sec 4317 187 KBytes [ 4] 5.00-6.00 sec 774 MBytes 6.49 Gbits/sec 4002 173 KBytes [ 4] 6.00-7.00 sec 737 MBytes 6.18 Gbits/sec 3904 191 KBytes [ 4] 7.00-8.00 sec 530 MBytes 4.44 Gbits/sec 3318 189 KBytes [ 4] 8.00-9.00 sec 487 MBytes 4.09 Gbits/sec 2627 188 KBytes [ 4] 9.00-10.00 sec 770 MBytes 6.46 Gbits/sec 4221 170 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec 39207 sender [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec receiver iperf Done. PacketMMap: bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 657 MBytes 5.51 Gbits/sec 0 1.01 MBytes [ 4] 1.00-2.00 sec 1021 MBytes 8.56 Gbits/sec 0 1.01 MBytes [ 4] 2.00-3.00 sec 1.21 GBytes 10.4 Gbits/sec 45 1.01 MBytes [ 4] 3.00-4.00 sec 1018 MBytes 8.54 Gbits/sec 15 1.01 MBytes [ 4] 4.00-5.00 sec 1.28 GBytes 11.0 Gbits/sec 45 1.01 MBytes [ 4] 5.00-6.00 sec 1.38 GBytes 11.9 Gbits/sec 0 1.01 MBytes [ 4] 6.00-7.00 sec 1.34 GBytes 11.5 Gbits/sec 45 856 KBytes [ 4] 7.00-8.00 sec 1.23 GBytes 10.5 Gbits/sec 0 901 KBytes [ 4] 8.00-9.00 sec 1010 MBytes 8.48 Gbits/sec 0 923 KBytes [ 4] 9.00-10.00 sec 1.39 GBytes 11.9 Gbits/sec 0 960 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec 150 sender [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec receiver Updates google#210 PiperOrigin-RevId: 244968438 Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b Upstream-commit: 99b877f

majek · 2023-04-26T07:58:22Z

@hbhasker I keep coming back to this.

Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry

Do you happen to know when exactly this problem might happen with PACKET_MMAP 1/2? is it on RX or TX?

hbhasker · 2023-04-26T12:19:12Z

The bug itself is fixed in latest kernels. I never implemented the tx part so this happens on rx. That said @kevinGC has already implemented AF_XDP based rx/tx and I would just use that instead of this. See https://github.com/google/gvisor/blob/master/pkg/tcpip/link/xdp/endpoint.go

majek · 2023-04-26T12:24:16Z

Interesting. I'm currently investigating vhost-net, and this might be an option as well. It allows for very fast tap device read/write, however, undeniably exposes larger host attack surface.

hbhasker · 2023-04-26T12:36:06Z

Vhost-net would be an interesting alternative to af_xdp. As af_xdp if I remember correctly can't support frames larger than 4k? So you lose host gro support/gso support. Though @kevinGC has added GRO support to Netstack which probably combined with AF_XDP might solve a large part of that.

Another alternative would be to use the sharedmem link endpoint. It is similar to vhost-net and allows setting up shared memory queues with another process running outside the sandbox. Which can then handle packets and send them to the host or redirect them.

You would need to add support in docker/containers for a cni plugin that creates a nic backed by an fd connected to the external process.

Currently the server endpoint waits on an event fd to be notified when packets are available. But if a single process was to serve many sandboxes then maybe you could just burn a single core to spin across all sandboxes and process packets to reduce delays due to wakeup latencies etc.

kevinGC · 2023-04-27T20:52:36Z

Yes, AF_XDP is limited to 4K frames. Something to watch out for: supported frame size depends on the driver.

Note that our AF_XDP support is not "battle-hardened" like PACKET_MMAP is. If you do use it we'd definitely appreciate feedback and bug reports. Ideally it becomes default someday.

GRO can be enabled via flag, e.g. --gvisor-gro=200000ns. It too should be on-by-default in time.

github-actions · 2023-09-15T00:21:02Z

A friendly reminder that this issue had no activity for 120 days.

github-actions · 2023-12-15T00:18:48Z

This issue has been closed due to lack of activity.

manninglucas · 2024-04-24T21:32:56Z

I tried my own hand at getting more speed with TPACKET_V3 and ran into a similar latency/throughput tradeoff issue. Even with the kernel fix @hbhasker mentioned packet_mmap v1/2 is still slower than rcvmmsg — my testing shows it pulls 1-3 packets before blocking. Rcvmmsg always gets 8.

github-actions · 2024-08-23T00:19:40Z

A friendly reminder that this issue had no activity for 120 days.

kevinGC · 2024-09-27T19:08:21Z

Closing this and opening a feature request at #10966. Dispatch is no longer buggy, but could use improvements.

hbhasker self-assigned this Apr 23, 2019

hbhasker added area: networking Issue related to networking type: bug Something isn't working priority: p1 High priority labels Apr 23, 2019

hbhasker added priority: p3 Low priority and removed priority: p1 High priority labels Apr 25, 2019

hbhasker changed the title ~~runsc: PacketMMAP dispatch mode is buggy switch to TPacketV3~~ runsc: PacketMMAP dispatch mode is buggy. Apr 25, 2019

snyk-bot mentioned this issue Mar 5, 2020

[Snyk] Upgrade @google-cloud/monitoring from 0.6.0 to 0.7.1 qsays/gvisor#1

Open

hbhasker removed their assignment Apr 25, 2023

github-actions bot added the stale-issue This issue has not been updated in 120 days. label Sep 15, 2023

github-actions bot added the auto-closed label Dec 15, 2023

github-actions bot closed this as completed Dec 15, 2023

manninglucas reopened this Apr 24, 2024

github-actions bot removed auto-closed stale-issue This issue has not been updated in 120 days. labels Apr 25, 2024

github-actions bot added the stale-issue This issue has not been updated in 120 days. label Aug 23, 2024

kevinGC closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runsc: PacketMMAP dispatch mode is buggy. #210

runsc: PacketMMAP dispatch mode is buggy. #210

hbhasker commented Apr 23, 2019

hbhasker commented Apr 25, 2019

majek commented Apr 26, 2023

hbhasker commented Apr 26, 2023

majek commented Apr 26, 2023

hbhasker commented Apr 26, 2023

kevinGC commented Apr 27, 2023 •

edited

Loading

github-actions bot commented Sep 15, 2023

github-actions bot commented Dec 15, 2023

manninglucas commented Apr 24, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024

kevinGC commented Sep 27, 2024

runsc: PacketMMAP dispatch mode is buggy. #210

runsc: PacketMMAP dispatch mode is buggy. #210

Comments

hbhasker commented Apr 23, 2019

hbhasker commented Apr 25, 2019

majek commented Apr 26, 2023

hbhasker commented Apr 26, 2023

majek commented Apr 26, 2023

hbhasker commented Apr 26, 2023

kevinGC commented Apr 27, 2023 • edited Loading

github-actions bot commented Sep 15, 2023

github-actions bot commented Dec 15, 2023

manninglucas commented Apr 24, 2024 • edited Loading

github-actions bot commented Aug 23, 2024

kevinGC commented Sep 27, 2024

kevinGC commented Apr 27, 2023 •

edited

Loading

manninglucas commented Apr 24, 2024 •

edited

Loading