ringbuf: fix corrupt samples on arm64 #1375

paulcacheux · 2024-03-13T12:12:15Z

I have been investigating some very rare event corruption in the output sample data coming from the eBPF ringbuffer. Looking at libbpf, and aya (rust equivalent) it seems we should be reading the len field in the header atomically. Matching this kernel operation

I have been able to confirm that this fixes my issue.

My current understanding is that for some reason, probably a mix of optimizer optimization and CPU trickery, we end up reading the correct len in the header but the busy bit is missing, meaning that we end up returning a sample with not yet commited data. I will expand on that once I improve my understanding of the issue.

ti-mo

Thanks, good catch. Looking at the code on the kernel side, it looks like we indeed need to be careful to at least read things in the right order.

In the interest of catching any future regressions, could you share a bit more about where you've seen this occur? Has your fix been rolled out through the datadog agent, or was this seen on internal systems only? Any chance of finding a reproducer?

internal/unix/types_other.go

paulcacheux · 2024-03-13T15:50:04Z

Basically the pattern we have is that we send events that look like

| cpu | timestamp | event type | rest of the data, can be of different length depending on the even type |

Recently we added some metrics on the event type and saw that in some rare cases the event type was completely unexpected, which led us to start the investigation leading to this PR.
The len on the other hand is not unexpected, it's always the length of our more common event types, not some garbage value or something like that.

I have not been able to create a reproducer yet, the only way I have currently to reproduce this is to deploy to a lot of machines and wait multiple hours.

lmb

Thanks, very interesting! Do you understand how user space can end up with the busy bit missing? From my perspective:

BPF reserves space in the ringbuf. This sets the BPF_RINGBUF_BUSY_BIT. This set is not atomic!
The kernel does an an atomic store to prod_pos.
User space does an atomic load of prod_pos.
User space copies len (non atomically). BPF_RINGBUF_BUSY_BIT is not present.

The only way I can see this happening is if:

The assignment of len = size | BUSY_BIT is decomposed into len = size; len |= BUSY in assembly.
The store to prod_pos is reordered to happen after len = size and before len |= BUSY.

The problem with this theory is that I don't understand why reading len atomically would fix this problem, since there is no atomic write to len on the kernel side? (At least until we hit the xchg in bpf_ringbuf_commit)

ringbuf/reader.go

ringbuf/ring.go

lmb · 2024-03-13T16:55:58Z

Follow up question: does happen on arm64 and amd64?

lmb · 2024-03-13T17:46:45Z

Here is len = size | BUSY_BIT on x86-64: https://gist.github.com/lmb/df04ba01510f344f4e18bbe90aca7fe4#file-disas-asm-L86-L93. Doesn't seem like the store could be reordered so that the busy bit isn't present.

paulcacheux · 2024-03-13T17:51:14Z

I just checked and I only see this corruption on arm64 hosts, should have checked this before this is good intel

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>

Reading the len field of the event header needs to be atomic to match the release memory barrier from https://github.com/torvalds/linux/blob/eb26cbb1a754ccde5d4d74527dad5ba051808fad/kernel/bpf/ringbuf.c#L487 We also don't care about the PgOff field which is only used by ringbuf internals. [ Additional context by Lorenz ] Based on a suggestion by Daniel Borkmann I wrote a litmus test against the Linux Kernel Memory Model: C ringbuf {} P0(int *hdr_len, int *prod_pos, int *data) // Producer { // __bpf_ringbuf_reserve *hdr_len = 1; // BUSY smp_store_release(prod_pos, 1); // store some data. *data = 1; // bpf_ringbuf_commit int r0 = xchg(hdr_len, 2); // COMMIT } P1(int *hdr_len, int *prod_pos, int *data) // Consumer { int p = smp_load_acquire(prod_pos); if (p > 0) { int l = *hdr_len; if (l == 2) { int d = *data; } } } exists (prod_pos=1 /\ 1:l=2 /\ 1:d=0) Note that hdr_len is read via a plain read. Executing the litmus test shows that we can indeed observe a COMMIT (l=2) before data is visible: $ herd7 -conf linux-kernel.cfg litmus-tests/ringbuf.litmus Test ringbuf Allowed States 4 1:d=0; 1:l=0; [prod_pos]=1; 1:d=0; 1:l=1; [prod_pos]=1; 1:d=0; 1:l=2; [prod_pos]=1; 1:d=1; 1:l=2; [prod_pos]=1; Ok Witnesses Positive: 1 Negative: 3 Flag data-race Condition exists ([prod_pos]=1 /\ 1:l=2 /\ 1:d=0) Observation ringbuf Sometimes 1 3 Time ringbuf 0.01 Hash=7aed947102aa203443a5b9b5884f956f This is fixed by using an smp_load_acquire or equivalent to read hdr_len. Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com> Co-developed-by: Lorenz Bauer <lmb@isovalent.com>

lmb · 2024-03-14T15:15:10Z

Fixed up the second commit to handle io.EOF correct and added some context to the commit message. I ended up "validating" this fix by writing a litmus test against the Linux Kernel Memory Model.

lmb

The reason this happens isn't because something weird happens in __bpf_ringbuf_reserve. The problem is that user space (aka we) don't enfore an ordering between hdr->len being visible and data being written into the ring, at least as far as I can understand it. This is a problem on arm64 since it has weaker coherency (?) guarantees by default.

paulcacheux · 2024-03-14T15:19:02Z

I have been trying to build a small reproducer all day without much success.. thanks a lot !

paulcacheux force-pushed the ringbuf-atomic-header branch from 20f2f28 to 5e944c6 Compare March 13, 2024 12:41

paulcacheux marked this pull request as ready for review March 13, 2024 12:49

paulcacheux requested a review from a team as a code owner March 13, 2024 12:49

paulcacheux force-pushed the ringbuf-atomic-header branch from 5e944c6 to f8ed272 Compare March 13, 2024 12:52

ti-mo reviewed Mar 13, 2024

View reviewed changes

internal/unix/types_other.go Outdated Show resolved Hide resolved

paulcacheux force-pushed the ringbuf-atomic-header branch from f8ed272 to 7496426 Compare March 13, 2024 15:51

lmb requested changes Mar 13, 2024

View reviewed changes

ringbuf/reader.go Show resolved Hide resolved

ringbuf/ring.go Show resolved Hide resolved

paulcacheux force-pushed the ringbuf-atomic-header branch from 7496426 to b013649 Compare March 13, 2024 17:50

ringbuf: replace ringbufHeaderSize with unix.BPF_RINGBUF_HDR_SZ

68149b4

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>

lmb force-pushed the ringbuf-atomic-header branch from b013649 to 82ab9a9 Compare March 14, 2024 15:08

lmb force-pushed the ringbuf-atomic-header branch from 82ab9a9 to 02732d9 Compare March 14, 2024 15:13

lmb approved these changes Mar 14, 2024

View reviewed changes

lmb changed the title ~~ringbuf: fix rare event corruption by reading event header atomically~~ ringbuf: fix corrupt samples on arm64 Mar 14, 2024

lmb merged commit c37bef4 into cilium:main Mar 14, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ringbuf: fix corrupt samples on arm64 #1375

ringbuf: fix corrupt samples on arm64 #1375

paulcacheux commented Mar 13, 2024 •

edited

Loading

ti-mo left a comment

paulcacheux commented Mar 13, 2024

lmb left a comment

lmb commented Mar 13, 2024

lmb commented Mar 13, 2024

paulcacheux commented Mar 13, 2024 •

edited

Loading

lmb commented Mar 14, 2024

lmb left a comment

paulcacheux commented Mar 14, 2024

ringbuf: fix corrupt samples on arm64 #1375

ringbuf: fix corrupt samples on arm64 #1375

Conversation

paulcacheux commented Mar 13, 2024 • edited Loading

ti-mo left a comment

Choose a reason for hiding this comment

paulcacheux commented Mar 13, 2024

lmb left a comment

Choose a reason for hiding this comment

lmb commented Mar 13, 2024

lmb commented Mar 13, 2024

paulcacheux commented Mar 13, 2024 • edited Loading

lmb commented Mar 14, 2024

lmb left a comment

Choose a reason for hiding this comment

paulcacheux commented Mar 14, 2024

paulcacheux commented Mar 13, 2024 •

edited

Loading

paulcacheux commented Mar 13, 2024 •

edited

Loading