Parsing layer blob: Broken pipe #657

cgwalters · 2024-08-30T12:41:16Z

stderr: ERROR: Switching: Pulling: Importing: Parsing layer blob sha256:4367367aae6325ce7351edb720e7e6929a7f369205b38fa88d140b7e3d0a274f: Broken pipe (os error 32)"

This one is like my enemy! I have a tracker over here for it coreos/rpm-ostree#4567 too

Discoveries so far:

tar: Hold open input stream as long as possible #527 does not help (but doesn't hurt either)
This test ends up pulling a 1.5G layer
Retrying does work - so we're apparently writing the layer correctly

More generally it's definitely a race condition; I can sometimes reproduce this by doing
ostree refs --delete ostree/container and then re-running the rebase.

Also of note: kola defaults to a uniprocessor VM, which I think is more likely to expose this race.

I'm quite certain it has something to do with the scheduling of us closing the pipe vs calling FinishPipe.

The text was updated successfully, but these errors were encountered:

ggiguash · 2024-09-04T09:31:43Z

@cgwalters , the problem came up again in MicroShift CI (see this log as an example).
Is there anything we can add to our CI on-error handler to produce debugging information to help troubleshoot this problem?

eslutsky · 2024-09-17T12:56:07Z

@cgwalters , our CI is in great pain because of this issue, is there a known workaround ?

cgwalters · 2024-09-19T19:09:24Z

@cgwalters , our CI is in great pain because of this issue, is there a known workaround ?

Not currently other than retrying, which we should probably add some defaults to do that...

ggiguash · 2024-09-20T06:29:53Z

@cgwalters , this problem is causing a lot of grief in MicroShift CI. It's all over either booting VMs, or even running tests that pull a new layer.

Is there a chance we add the retry w/a sooner than later in CentOS 9 and also backport it to RHEL 9.4? Once implementer, we should immediatelly provide feedback whether it made a difference.

ggiguash · 2024-09-25T05:30:26Z

I can "easily" reproduce this problem on a local VM by running the following command to alternate between the images (new -> original -> new -> etc.) without rebooting.

Is there any additional debug information I can provide to give a clue on why this is happening?

# while time bootc switch 192.168.100.1:5000/cos9-bootc-source-optionals --quiet ; do time bootc switch 192.168.100.1:5000/cos9-bootc-source --quiet; done
layers already present: 74; layers needed: 1 (45.6 MB)
Queued for next boot: 192.168.100.1:5000/cos9-bootc-source-optionals
  Version: stream9.20240913.0
  Digest: sha256:1c1de27d0e3d16513687381b702f168726352ae623c04e3272659d1adfddc301

real    0m18.082s
user    0m10.986s
sys     0m3.943s
No changes in 192.168.100.1:5000/cos9-bootc-source => sha256:f2e7c8c61b002dff80e33eb44320d5c9c1d272957227008715e51310be94cb2f
Pruned images: 1 (layers: 7, objsize: 330.1 MB)
Queued for next boot: 192.168.100.1:5000/cos9-bootc-source
  Version: stream9.20240913.0
  Digest: sha256:f2e7c8c61b002dff80e33eb44320d5c9c1d272957227008715e51310be94cb2f

real    0m8.375s
user    0m3.089s
sys     0m3.610s

...
...

layers already present: 68; layers needed: 7 (807.4 MB)
ERROR Switching: Pulling: Importing: Parsing layer blob sha256:b867f3d4a54bac425f7782d4953923c99729b03b19668cf3cbe6176ba4b557e3: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe

real    0m20.450s
user    0m5.339s
sys     0m3.524s

On the registry server side, I'm seeing the following messages in the log when the problem occurs. Looks like the issue is on the client.

time="2024-09-25T05:21:48.560496809Z" level=info msg="response completed" go.version=go1.20.8 http.request.host="192.168.100.1:5000" http.request.id=ccfd5189-e2f8-41da-ad28-e99c8046c909 http.request.method=GET http.request.remoteaddr="192.168.100.151:50422" http.request.uri="/v2/cos9-bootc-source-optionals/blobs/sha256:b867f3d4a54bac425f7782d4953923c99729b03b19668cf3cbe6176ba4b557e3" http.request.useragent="skopeo/1.16.1" http.response.contenttype="application/octet-stream" http.response.duration=945.016873ms http.response.status=200 http.response.written=45551104 
192.168.100.151 - - [25/Sep/2024:05:21:47 +0000] "GET /v2/cos9-bootc-source-optionals/blobs/sha256:b867f3d4a54bac425f7782d4953923c99729b03b19668cf3cbe6176ba4b557e3 HTTP/1.1" 200 45551104 "" "skopeo/1.16.1"

jeckersb · 2024-09-25T18:42:01Z

I can "easily" reproduce this problem on a local VM by running the following command to alternate between the images (new -> original -> new -> etc.) without rebooting.

I've been trying to reproduce this today without much success. I managed to do it one time by switching back and forth, and then I started poking at things with perf, but haven't been able to reproduce it anymore. I think I must have just gotten really lucky with the first success. I wouldn't expect perf to alter the timings enough to push this into Heisenbug territory since it's mostly "out-of-band", but who knows. I'll let it continue to run in a loop for a while and see if I can't get it to trigger again.

cgwalters · 2024-09-27T09:08:54Z

As far as tweaking timing one thing to try is using e.g. strace fault injection to add random delays (especially to read+write), something like:

strace -f -e inject=read:when=1+3:delay_enter=100ms bootc ...

(Fine tune the injection step and delay as desired)

jeckersb · 2024-09-27T14:15:27Z

I let it run for a bit more Wednesday, no luck.

Yesterday I took the opportunity to rebuild my dev setup with a fresh podman-machine, and latest podman-bootc from git. Then I let it run all day trying to reproduce. More precisely, what I did was:

Built latest bootc image from main following the instructions
Used podman-bootc to run that
I mirrored fedora-bootc:40 and fedora-bootc:41 into my local registry (I went directly at quay.io for while but started getting random timeouts)
Switched back and forth between 40 and 41 using: while perf record -e raw_syscalls:sys_exit --filter 'ret == -32' -a -g bootc switch --quiet 192.168.122.1:5000/fedora-bootc:40 && perf record -e raw_syscalls:sys_exit --filter 'ret == -32' -a -g bootc switch --quiet 192.168.122.1:5000/fedora-bootc:41; do :; done

Couldn't get it to reproduce, ran it for about 6 hours like that. Then I ran it again, except removing all of the perf bits just to see if that made a difference, but it did not. Still unable to reproduce it.

At this point it's driving me absolutely mad that I was able to reproduce it the one time within like 5 minutes of trying, and then never again 😠

ggiguash · 2024-09-29T07:40:19Z

@jeckersb, I used the following commands to reproduce the error and I'm attaching the perf.data file.
What else can I do to help?

bootc usr-overlay
dnf install -y perf

SERV='192.168.100.1:5000'
PERF='perf record -e raw_syscalls:sys_exit --filter "ret==-32" -a -g'
CMD1='bootc switch --quiet $SERV/cos9-bootc-source-optionals'
CMD2='bootc switch --quiet $SERV/cos9-bootc-source'

printf "[[registry]]\nlocation = \"${SERV}\"\ninsecure = true\n" > /etc/containers/registries.conf.d/999-insecure-registry.conf

while true; do
    eval "$PERF $CMD1" || break
    eval "$PERF $CMD2" || break
done

The error was the "usual" one.

ERROR Switching: Pulling: Importing: Parsing layer blob sha256:d2dbfce5c205c1e1408e7efc2588dc4f9e06cbb71df6203cbbbf4149d0de80cb: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe

perf.data.zip

cgwalters mentioned this issue Aug 30, 2024

intermittent failure in ext.rpm-ostree.destructive.container-image coreos/rpm-ostree#4567

Closed

ggiguash mentioned this issue Sep 22, 2024

USHIFT-4518: Add retries on rebasing bootc or ostree systems openshift/microshift#3952

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing layer blob: Broken pipe #657

Parsing layer blob: Broken pipe #657

cgwalters commented Aug 30, 2024

ggiguash commented Sep 4, 2024

eslutsky commented Sep 17, 2024

cgwalters commented Sep 19, 2024

ggiguash commented Sep 20, 2024 •

edited

Loading

ggiguash commented Sep 25, 2024

jeckersb commented Sep 25, 2024

cgwalters commented Sep 27, 2024

jeckersb commented Sep 27, 2024

ggiguash commented Sep 29, 2024 •

edited

Loading

Parsing layer blob: Broken pipe #657

Parsing layer blob: Broken pipe #657

Comments

cgwalters commented Aug 30, 2024

ggiguash commented Sep 4, 2024

eslutsky commented Sep 17, 2024

cgwalters commented Sep 19, 2024

ggiguash commented Sep 20, 2024 • edited Loading

ggiguash commented Sep 25, 2024

jeckersb commented Sep 25, 2024

cgwalters commented Sep 27, 2024

jeckersb commented Sep 27, 2024

ggiguash commented Sep 29, 2024 • edited Loading

ggiguash commented Sep 20, 2024 •

edited

Loading

ggiguash commented Sep 29, 2024 •

edited

Loading