Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runsc: PacketMMAP dispatch mode is buggy. #210

Closed
hbhasker opened this issue Apr 23, 2019 · 11 comments
Closed

runsc: PacketMMAP dispatch mode is buggy. #210

hbhasker opened this issue Apr 23, 2019 · 11 comments
Labels
area: networking Issue related to networking priority: p3 Low priority stale-issue This issue has not been updated in 120 days. type: bug Something isn't working

Comments

@hbhasker
Copy link
Contributor

TPACKEtv1/v2 PACKET_RX|TX_RING buffers have a kernel bug where multiple kernel threads can incorrectly assume control of a ring buffer entry resulting in the ring buffer being out of sync with user-space. This seems to manifest itself in our dispatcher in a way where we get stuck in a poll loop where poll returns readable but the frame at our ring offset location does not have the user-bit set.

This completely stalls network. It seems to be easier to trigger when there are lots of connections to an external destination and streaming large amounts of data in.

The bug is described in this thread here

https://www.mail-archive.com/[email protected]/msg237222.html

This is not fixed in the kernel. There are ways to side step it by increasing number of frames which reduces the likelyhood of being hit by this bug. But from my experiments using smaller frames is not worth it as it requires GRO to be disabled but at the same time in a VM GRO/TSO is almost impossible to disable completely. See
http://patchwork.ozlabs.org/patch/1016373/

For now we should revert to use RecvMMsg dispatch mode for runsc while change the ring code to use TPACKET_v3 instead of v1.

@hbhasker hbhasker self-assigned this Apr 23, 2019
@hbhasker hbhasker added area: networking Issue related to networking type: bug Something isn't working priority: p1 High priority labels Apr 23, 2019
shentubot pushed a commit that referenced this issue Apr 24, 2019
This CL fixes the following bugs:

- Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc
which are not atomic.

- Increments ringOffsets for frames that are truncated (i.e status is
  tpStatusCopy)

- Does not ignore frames with tpStatusLost bit set as they are valid frames and
  only indicate that there some frames were lost before this one and metrics can
  be retrieved with a getsockopt call.

- Adds checks to make sure blockSize is a multiple of page size. This is
  required as the kernel allocates in pages per block and rejects sizes that are
  not page aligned with an EINVAL.

Updates #210

PiperOrigin-RevId: 244959464
Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f
shentubot pushed a commit that referenced this issue Apr 24, 2019
PacketMMap mode has issues due to a kernel bug. This change
reverts us to using recvmmsg instead of a shared ring buffer to
dispatch inbound packets. This will reduce performance but should
be more stable under heavy load till PacketMMap is updated to
use TPacketv3.

See #210 for details.

Perf difference between recvmmsg vs packetmmap.

RecvMMsg :
iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   778 MBytes  6.53 Gbits/sec  4349    188 KBytes
[  4]   1.00-2.00   sec   786 MBytes  6.59 Gbits/sec  4395    212 KBytes
[  4]   2.00-3.00   sec   756 MBytes  6.34 Gbits/sec  3655    161 KBytes
[  4]   3.00-4.00   sec   782 MBytes  6.56 Gbits/sec  4419    175 KBytes
[  4]   4.00-5.00   sec   755 MBytes  6.34 Gbits/sec  4317    187 KBytes
[  4]   5.00-6.00   sec   774 MBytes  6.49 Gbits/sec  4002    173 KBytes
[  4]   6.00-7.00   sec   737 MBytes  6.18 Gbits/sec  3904    191 KBytes
[  4]   7.00-8.00   sec   530 MBytes  4.44 Gbits/sec  3318    189 KBytes
[  4]   8.00-9.00   sec   487 MBytes  4.09 Gbits/sec  2627    188 KBytes
[  4]   9.00-10.00  sec   770 MBytes  6.46 Gbits/sec  4221    170 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec  39207             sender
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec                  receiver

iperf Done.

PacketMMap:

bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   657 MBytes  5.51 Gbits/sec    0   1.01 MBytes
[  4]   1.00-2.00   sec  1021 MBytes  8.56 Gbits/sec    0   1.01 MBytes
[  4]   2.00-3.00   sec  1.21 GBytes  10.4 Gbits/sec   45   1.01 MBytes
[  4]   3.00-4.00   sec  1018 MBytes  8.54 Gbits/sec   15   1.01 MBytes
[  4]   4.00-5.00   sec  1.28 GBytes  11.0 Gbits/sec   45   1.01 MBytes
[  4]   5.00-6.00   sec  1.38 GBytes  11.9 Gbits/sec    0   1.01 MBytes
[  4]   6.00-7.00   sec  1.34 GBytes  11.5 Gbits/sec   45    856 KBytes
[  4]   7.00-8.00   sec  1.23 GBytes  10.5 Gbits/sec    0    901 KBytes
[  4]   8.00-9.00   sec  1010 MBytes  8.48 Gbits/sec    0    923 KBytes
[  4]   9.00-10.00  sec  1.39 GBytes  11.9 Gbits/sec    0    960 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec  150             sender
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec                  receiver

Updates #210

PiperOrigin-RevId: 244968438
Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b
@hbhasker
Copy link
Contributor Author

Just an update.

I implemented tpacket_v3 based dispatcher, while that does not suffer from the bug in tpacket v1/v2 and does pretty good throughput on a loaded network, it does suffer from latency issues when the network is not busy.

This is because the way it works is each block is filled 1 at a time by the kernel and only when the block is full or hits the blockTimeout is the skb->data_ready() invoked. The lowest resolution for the blockTimeout is 1ms which is far too high for high speed networking and introduces a large amount of delay.

A simple ping of the host from gVisor can see latencies from 0.4ms -> 7ms. TPacket_V3 is not usable unless a new lower resolution timer is added.

The only path forward for PACKET_RX_RING will be to wait for a fix to the kernel bug.

I believe in the short term we should stick with recvmmsg. I am going to downgrade the priority of this bug as we don't really have a path forward at the moment to use PACKET_RX_RING.

@hbhasker hbhasker added priority: p3 Low priority and removed priority: p1 High priority labels Apr 25, 2019
@hbhasker hbhasker changed the title runsc: PacketMMAP dispatch mode is buggy switch to TPacketV3 runsc: PacketMMAP dispatch mode is buggy. Apr 25, 2019
tonistiigi pushed a commit to tonistiigi/gvisor that referenced this issue May 3, 2019
This CL fixes the following bugs:

- Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc
which are not atomic.

- Increments ringOffsets for frames that are truncated (i.e status is
  tpStatusCopy)

- Does not ignore frames with tpStatusLost bit set as they are valid frames and
  only indicate that there some frames were lost before this one and metrics can
  be retrieved with a getsockopt call.

- Adds checks to make sure blockSize is a multiple of page size. This is
  required as the kernel allocates in pages per block and rejects sizes that are
  not page aligned with an EINVAL.

Updates google#210

PiperOrigin-RevId: 244959464
Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f
Upstream-commit: 56cadca
tonistiigi pushed a commit to tonistiigi/gvisor that referenced this issue May 3, 2019
PacketMMap mode has issues due to a kernel bug. This change
reverts us to using recvmmsg instead of a shared ring buffer to
dispatch inbound packets. This will reduce performance but should
be more stable under heavy load till PacketMMap is updated to
use TPacketv3.

See google#210 for details.

Perf difference between recvmmsg vs packetmmap.

RecvMMsg :
iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   778 MBytes  6.53 Gbits/sec  4349    188 KBytes
[  4]   1.00-2.00   sec   786 MBytes  6.59 Gbits/sec  4395    212 KBytes
[  4]   2.00-3.00   sec   756 MBytes  6.34 Gbits/sec  3655    161 KBytes
[  4]   3.00-4.00   sec   782 MBytes  6.56 Gbits/sec  4419    175 KBytes
[  4]   4.00-5.00   sec   755 MBytes  6.34 Gbits/sec  4317    187 KBytes
[  4]   5.00-6.00   sec   774 MBytes  6.49 Gbits/sec  4002    173 KBytes
[  4]   6.00-7.00   sec   737 MBytes  6.18 Gbits/sec  3904    191 KBytes
[  4]   7.00-8.00   sec   530 MBytes  4.44 Gbits/sec  3318    189 KBytes
[  4]   8.00-9.00   sec   487 MBytes  4.09 Gbits/sec  2627    188 KBytes
[  4]   9.00-10.00  sec   770 MBytes  6.46 Gbits/sec  4221    170 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec  39207             sender
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec                  receiver

iperf Done.

PacketMMap:

bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   657 MBytes  5.51 Gbits/sec    0   1.01 MBytes
[  4]   1.00-2.00   sec  1021 MBytes  8.56 Gbits/sec    0   1.01 MBytes
[  4]   2.00-3.00   sec  1.21 GBytes  10.4 Gbits/sec   45   1.01 MBytes
[  4]   3.00-4.00   sec  1018 MBytes  8.54 Gbits/sec   15   1.01 MBytes
[  4]   4.00-5.00   sec  1.28 GBytes  11.0 Gbits/sec   45   1.01 MBytes
[  4]   5.00-6.00   sec  1.38 GBytes  11.9 Gbits/sec    0   1.01 MBytes
[  4]   6.00-7.00   sec  1.34 GBytes  11.5 Gbits/sec   45    856 KBytes
[  4]   7.00-8.00   sec  1.23 GBytes  10.5 Gbits/sec    0    901 KBytes
[  4]   8.00-9.00   sec  1010 MBytes  8.48 Gbits/sec    0    923 KBytes
[  4]   9.00-10.00  sec  1.39 GBytes  11.9 Gbits/sec    0    960 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec  150             sender
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec                  receiver

Updates google#210

PiperOrigin-RevId: 244968438
Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b
Upstream-commit: 99b877f
@hbhasker hbhasker removed their assignment Apr 25, 2023
@majek
Copy link
Contributor

majek commented Apr 26, 2023

@hbhasker I keep coming back to this.

Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry

Do you happen to know when exactly this problem might happen with PACKET_MMAP 1/2? is it on RX or TX?

@hbhasker
Copy link
Contributor Author

The bug itself is fixed in latest kernels. I never implemented the tx part so this happens on rx. That said @kevinGC has already implemented AF_XDP based rx/tx and I would just use that instead of this. See https://github.com/google/gvisor/blob/master/pkg/tcpip/link/xdp/endpoint.go

@majek
Copy link
Contributor

majek commented Apr 26, 2023

Interesting. I'm currently investigating vhost-net, and this might be an option as well. It allows for very fast tap device read/write, however, undeniably exposes larger host attack surface.

@hbhasker
Copy link
Contributor Author

Vhost-net would be an interesting alternative to af_xdp. As af_xdp if I remember correctly can't support frames larger than 4k? So you lose host gro support/gso support. Though @kevinGC has added GRO support to Netstack which probably combined with AF_XDP might solve a large part of that.

Another alternative would be to use the sharedmem link endpoint. It is similar to vhost-net and allows setting up shared memory queues with another process running outside the sandbox. Which can then handle packets and send them to the host or redirect them.

You would need to add support in docker/containers for a cni plugin that creates a nic backed by an fd connected to the external process.

Currently the server endpoint waits on an event fd to be notified when packets are available. But if a single process was to serve many sandboxes then maybe you could just burn a single core to spin across all sandboxes and process packets to reduce delays due to wakeup latencies etc.

@kevinGC
Copy link
Collaborator

kevinGC commented Apr 27, 2023

Yes, AF_XDP is limited to 4K frames. Something to watch out for: supported frame size depends on the driver.

Note that our AF_XDP support is not "battle-hardened" like PACKET_MMAP is. If you do use it we'd definitely appreciate feedback and bug reports. Ideally it becomes default someday.

GRO can be enabled via flag, e.g. --gvisor-gro=200000ns. It too should be on-by-default in time.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 120 days.

@github-actions github-actions bot added the stale-issue This issue has not been updated in 120 days. label Sep 15, 2023
Copy link

This issue has been closed due to lack of activity.

@manninglucas
Copy link
Contributor

manninglucas commented Apr 24, 2024

I tried my own hand at getting more speed with TPACKET_V3 and ran into a similar latency/throughput tradeoff issue. Even with the kernel fix @hbhasker mentioned packet_mmap v1/2 is still slower than rcvmmsg — my testing shows it pulls 1-3 packets before blocking. Rcvmmsg always gets 8.

@manninglucas manninglucas reopened this Apr 24, 2024
@github-actions github-actions bot removed auto-closed stale-issue This issue has not been updated in 120 days. labels Apr 25, 2024
Copy link

A friendly reminder that this issue had no activity for 120 days.

@github-actions github-actions bot added the stale-issue This issue has not been updated in 120 days. label Aug 23, 2024
@kevinGC
Copy link
Collaborator

kevinGC commented Sep 27, 2024

Closing this and opening a feature request at #10966. Dispatch is no longer buggy, but could use improvements.

@kevinGC kevinGC closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: networking Issue related to networking priority: p3 Low priority stale-issue This issue has not been updated in 120 days. type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants