Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ringbuf: fix corrupt samples on arm64 #1375

Merged
merged 2 commits into from
Mar 14, 2024

Conversation

paulcacheux
Copy link
Contributor

@paulcacheux paulcacheux commented Mar 13, 2024

I have been investigating some very rare event corruption in the output sample data coming from the eBPF ringbuffer. Looking at libbpf, and aya (rust equivalent) it seems we should be reading the len field in the header atomically. Matching this kernel operation

I have been able to confirm that this fixes my issue.

My current understanding is that for some reason, probably a mix of optimizer optimization and CPU trickery, we end up reading the correct len in the header but the busy bit is missing, meaning that we end up returning a sample with not yet commited data. I will expand on that once I improve my understanding of the issue.

@paulcacheux paulcacheux marked this pull request as ready for review March 13, 2024 12:49
@paulcacheux paulcacheux requested a review from a team as a code owner March 13, 2024 12:49
Copy link
Collaborator

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch. Looking at the code on the kernel side, it looks like we indeed need to be careful to at least read things in the right order.

In the interest of catching any future regressions, could you share a bit more about where you've seen this occur? Has your fix been rolled out through the datadog agent, or was this seen on internal systems only? Any chance of finding a reproducer?

internal/unix/types_other.go Outdated Show resolved Hide resolved
@paulcacheux
Copy link
Contributor Author

Basically the pattern we have is that we send events that look like

| cpu | timestamp | event type | rest of the data, can be of different length depending on the even type |  

Recently we added some metrics on the event type and saw that in some rare cases the event type was completely unexpected, which led us to start the investigation leading to this PR.
The len on the other hand is not unexpected, it's always the length of our more common event types, not some garbage value or something like that.

I have not been able to create a reproducer yet, the only way I have currently to reproduce this is to deploy to a lot of machines and wait multiple hours.

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very interesting! Do you understand how user space can end up with the busy bit missing? From my perspective:

  1. BPF reserves space in the ringbuf. This sets the BPF_RINGBUF_BUSY_BIT. This set is not atomic!
  2. The kernel does an an atomic store to prod_pos.
  3. User space does an atomic load of prod_pos.
  4. User space copies len (non atomically). BPF_RINGBUF_BUSY_BIT is not present.

The only way I can see this happening is if:

  • The assignment of len = size | BUSY_BIT is decomposed into len = size; len |= BUSY in assembly.
  • The store to prod_pos is reordered to happen after len = size and before len |= BUSY.

The problem with this theory is that I don't understand why reading len atomically would fix this problem, since there is no atomic write to len on the kernel side? (At least until we hit the xchg in bpf_ringbuf_commit)

ringbuf/reader.go Show resolved Hide resolved
ringbuf/ring.go Show resolved Hide resolved
@lmb
Copy link
Collaborator

lmb commented Mar 13, 2024

Follow up question: does happen on arm64 and amd64?

@lmb
Copy link
Collaborator

lmb commented Mar 13, 2024

Here is len = size | BUSY_BIT on x86-64: https://gist.github.com/lmb/df04ba01510f344f4e18bbe90aca7fe4#file-disas-asm-L86-L93. Doesn't seem like the store could be reordered so that the busy bit isn't present.

@paulcacheux
Copy link
Contributor Author

paulcacheux commented Mar 13, 2024

I just checked and I only see this corruption on arm64 hosts, should have checked this before this is good intel

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
Reading the len field of the event header needs to be atomic to match
the release memory barrier from
https://github.com/torvalds/linux/blob/eb26cbb1a754ccde5d4d74527dad5ba051808fad/kernel/bpf/ringbuf.c#L487

We also don't care about the PgOff field which is only used by ringbuf
internals.

[ Additional context by Lorenz ]

Based on a suggestion by Daniel Borkmann I wrote a litmus test against
the Linux Kernel Memory Model:

    C ringbuf

    {}

    P0(int *hdr_len, int *prod_pos, int *data) // Producer
    {
        // __bpf_ringbuf_reserve
        *hdr_len = 1; // BUSY
        smp_store_release(prod_pos, 1);

        // store some data.
        *data = 1;

        // bpf_ringbuf_commit
        int r0 = xchg(hdr_len, 2); // COMMIT
    }

    P1(int *hdr_len, int *prod_pos, int *data) // Consumer
    {
        int p = smp_load_acquire(prod_pos);
        if (p > 0) {
            int l = *hdr_len;
            if (l == 2) {
                int d = *data;
            }
        }
    }

    exists (prod_pos=1 /\ 1:l=2 /\ 1:d=0)

Note that hdr_len is read via a plain read. Executing the litmus test
shows that we can indeed observe a COMMIT (l=2) before data is visible:

    $ herd7 -conf linux-kernel.cfg litmus-tests/ringbuf.litmus
    Test ringbuf Allowed
    States 4
    1:d=0; 1:l=0; [prod_pos]=1;
    1:d=0; 1:l=1; [prod_pos]=1;
    1:d=0; 1:l=2; [prod_pos]=1;
    1:d=1; 1:l=2; [prod_pos]=1;
    Ok
    Witnesses
    Positive: 1 Negative: 3
    Flag data-race
    Condition exists ([prod_pos]=1 /\ 1:l=2 /\ 1:d=0)
    Observation ringbuf Sometimes 1 3
    Time ringbuf 0.01
    Hash=7aed947102aa203443a5b9b5884f956f

This is fixed by using an smp_load_acquire or equivalent to read hdr_len.

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
Co-developed-by: Lorenz Bauer <lmb@isovalent.com>
@lmb
Copy link
Collaborator

lmb commented Mar 14, 2024

Fixed up the second commit to handle io.EOF correct and added some context to the commit message. I ended up "validating" this fix by writing a litmus test against the Linux Kernel Memory Model.

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this happens isn't because something weird happens in __bpf_ringbuf_reserve. The problem is that user space (aka we) don't enfore an ordering between hdr->len being visible and data being written into the ring, at least as far as I can understand it. This is a problem on arm64 since it has weaker coherency (?) guarantees by default.

@lmb lmb changed the title ringbuf: fix rare event corruption by reading event header atomically ringbuf: fix corrupt samples on arm64 Mar 14, 2024
@lmb lmb merged commit c37bef4 into cilium:main Mar 14, 2024
15 checks passed
@paulcacheux
Copy link
Contributor Author

I have been trying to build a small reproducer all day without much success.. thanks a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants