Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing layer blob: Broken pipe #657

Open
cgwalters opened this issue Aug 30, 2024 · 9 comments
Open

Parsing layer blob: Broken pipe #657

cgwalters opened this issue Aug 30, 2024 · 9 comments

Comments

@cgwalters
Copy link
Member

stderr: ERROR: Switching: Pulling: Importing: Parsing layer blob sha256:4367367aae6325ce7351edb720e7e6929a7f369205b38fa88d140b7e3d0a274f: Broken pipe (os error 32)"

This one is like my enemy! I have a tracker over here for it coreos/rpm-ostree#4567 too


Discoveries so far:

More generally it's definitely a race condition; I can sometimes reproduce this by doing
ostree refs --delete ostree/container and then re-running the rebase.

Also of note: kola defaults to a uniprocessor VM, which I think is more likely to expose this race.

I'm quite certain it has something to do with the scheduling of us closing the pipe vs calling FinishPipe.

@ggiguash
Copy link

ggiguash commented Sep 4, 2024

@cgwalters , the problem came up again in MicroShift CI (see this log as an example).
Is there anything we can add to our CI on-error handler to produce debugging information to help troubleshoot this problem?

@eslutsky
Copy link

@cgwalters , our CI is in great pain because of this issue, is there a known workaround ?

@cgwalters
Copy link
Member Author

@cgwalters , our CI is in great pain because of this issue, is there a known workaround ?

Not currently other than retrying, which we should probably add some defaults to do that...

@ggiguash
Copy link

ggiguash commented Sep 20, 2024

@cgwalters , this problem is causing a lot of grief in MicroShift CI. It's all over either booting VMs, or even running tests that pull a new layer.

Is there a chance we add the retry w/a sooner than later in CentOS 9 and also backport it to RHEL 9.4? Once implementer, we should immediatelly provide feedback whether it made a difference.

@ggiguash
Copy link

I can "easily" reproduce this problem on a local VM by running the following command to alternate between the images (new -> original -> new -> etc.) without rebooting.

Is there any additional debug information I can provide to give a clue on why this is happening?

# while time bootc switch 192.168.100.1:5000/cos9-bootc-source-optionals --quiet ; do time bootc switch 192.168.100.1:5000/cos9-bootc-source --quiet; done
layers already present: 74; layers needed: 1 (45.6 MB)
Queued for next boot: 192.168.100.1:5000/cos9-bootc-source-optionals
  Version: stream9.20240913.0
  Digest: sha256:1c1de27d0e3d16513687381b702f168726352ae623c04e3272659d1adfddc301

real    0m18.082s
user    0m10.986s
sys     0m3.943s
No changes in 192.168.100.1:5000/cos9-bootc-source => sha256:f2e7c8c61b002dff80e33eb44320d5c9c1d272957227008715e51310be94cb2f
Pruned images: 1 (layers: 7, objsize: 330.1 MB)
Queued for next boot: 192.168.100.1:5000/cos9-bootc-source
  Version: stream9.20240913.0
  Digest: sha256:f2e7c8c61b002dff80e33eb44320d5c9c1d272957227008715e51310be94cb2f

real    0m8.375s
user    0m3.089s
sys     0m3.610s

...
...

layers already present: 68; layers needed: 7 (807.4 MB)
ERROR Switching: Pulling: Importing: Parsing layer blob sha256:b867f3d4a54bac425f7782d4953923c99729b03b19668cf3cbe6176ba4b557e3: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe

real    0m20.450s
user    0m5.339s
sys     0m3.524s

On the registry server side, I'm seeing the following messages in the log when the problem occurs. Looks like the issue is on the client.

time="2024-09-25T05:21:48.560496809Z" level=info msg="response completed" go.version=go1.20.8 http.request.host="192.168.100.1:5000" http.request.id=ccfd5189-e2f8-41da-ad28-e99c8046c909 http.request.method=GET http.request.remoteaddr="192.168.100.151:50422" http.request.uri="/v2/cos9-bootc-source-optionals/blobs/sha256:b867f3d4a54bac425f7782d4953923c99729b03b19668cf3cbe6176ba4b557e3" http.request.useragent="skopeo/1.16.1" http.response.contenttype="application/octet-stream" http.response.duration=945.016873ms http.response.status=200 http.response.written=45551104 
192.168.100.151 - - [25/Sep/2024:05:21:47 +0000] "GET /v2/cos9-bootc-source-optionals/blobs/sha256:b867f3d4a54bac425f7782d4953923c99729b03b19668cf3cbe6176ba4b557e3 HTTP/1.1" 200 45551104 "" "skopeo/1.16.1"

@jeckersb
Copy link
Contributor

I can "easily" reproduce this problem on a local VM by running the following command to alternate between the images (new -> original -> new -> etc.) without rebooting.

I've been trying to reproduce this today without much success. I managed to do it one time by switching back and forth, and then I started poking at things with perf, but haven't been able to reproduce it anymore. I think I must have just gotten really lucky with the first success. I wouldn't expect perf to alter the timings enough to push this into Heisenbug territory since it's mostly "out-of-band", but who knows. I'll let it continue to run in a loop for a while and see if I can't get it to trigger again.

@cgwalters
Copy link
Member Author

As far as tweaking timing one thing to try is using e.g. strace fault injection to add random delays (especially to read+write), something like:

strace -f -e inject=read:when=1+3:delay_enter=100ms bootc ...

(Fine tune the injection step and delay as desired)

@jeckersb
Copy link
Contributor

I let it run for a bit more Wednesday, no luck.

Yesterday I took the opportunity to rebuild my dev setup with a fresh podman-machine, and latest podman-bootc from git. Then I let it run all day trying to reproduce. More precisely, what I did was:

  • Built latest bootc image from main following the instructions
  • Used podman-bootc to run that
  • I mirrored fedora-bootc:40 and fedora-bootc:41 into my local registry (I went directly at quay.io for while but started getting random timeouts)
  • Switched back and forth between 40 and 41 using: while perf record -e raw_syscalls:sys_exit --filter 'ret == -32' -a -g bootc switch --quiet 192.168.122.1:5000/fedora-bootc:40 && perf record -e raw_syscalls:sys_exit --filter 'ret == -32' -a -g bootc switch --quiet 192.168.122.1:5000/fedora-bootc:41; do :; done

Couldn't get it to reproduce, ran it for about 6 hours like that. Then I ran it again, except removing all of the perf bits just to see if that made a difference, but it did not. Still unable to reproduce it.

At this point it's driving me absolutely mad that I was able to reproduce it the one time within like 5 minutes of trying, and then never again 😠

@ggiguash
Copy link

ggiguash commented Sep 29, 2024

@jeckersb, I used the following commands to reproduce the error and I'm attaching the perf.data file.
What else can I do to help?

bootc usr-overlay
dnf install -y perf

SERV='192.168.100.1:5000'
PERF='perf record -e raw_syscalls:sys_exit --filter "ret==-32" -a -g'
CMD1='bootc switch --quiet $SERV/cos9-bootc-source-optionals'
CMD2='bootc switch --quiet $SERV/cos9-bootc-source'

printf "[[registry]]\nlocation = \"${SERV}\"\ninsecure = true\n" > /etc/containers/registries.conf.d/999-insecure-registry.conf

while true; do
    eval "$PERF $CMD1" || break
    eval "$PERF $CMD2" || break
done

The error was the "usual" one.

ERROR Switching: Pulling: Importing: Parsing layer blob sha256:d2dbfce5c205c1e1408e7efc2588dc4f9e06cbb71df6203cbbbf4149d0de80cb: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe

perf.data.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants