linux-kernel - Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbYtaPf0jjoiv16iKWRKkv9ZTH_hBiZMUF+PkjVGOC53A@mail.gmail.com>
Date: Mon, 6 Oct 2025 15:10:54 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yhs@...com>, 
	Song Liu <song@...nel.org>
Subject: Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer

On Mon, Sep 29, 2025 at 6:41 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:
> > On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
> >>
> >> From: Xu Kuohai <xukuohai@...wei.com>
> >>
> >> When the bpf ring buffer is full, new events can not be recorded util
> >
> > typo: until
> >
>
> ACK
>
> >> the consumer consumes some events to free space. This may cause critical
> >> events to be discarded, such as in fault diagnostic, where recent events
> >> are more critical than older ones.
> >>
> >> So add ovewrite mode for bpf ring buffer. In this mode, the new event
> >
> > overwrite, BPF
> >
>
> ACK
>
> >> overwrites the oldest event when the buffer is full.
> >>
> >> The scheme is as follows:
> >>
> >> 1. producer_pos tracks the next position to write new data. When there
> >>     is enough free space, producer simply moves producer_pos forward to
> >>     make space for the new event.
> >>
> >> 2. To avoid waiting for consumer to free space when the buffer is full,
> >>     a new variable overwrite_pos is introduced for producer. overwrite_pos
> >>     tracks the next event to be overwritten (the oldest event committed) in
> >>     the buffer. producer moves it forward to discard the oldest events when
> >>     the buffer is full.
> >>
> >> 3. pending_pos tracks the oldest event under committing. producer ensures
> >
> > "under committing" is confusing. Oldest event to be committed?
> >
>
> Yes, 'the oldest event to be committed'. Thanks!
>
> >>     producers_pos never passes pending_pos when making space for new events.
> >>     So multiple producers never write to the same position at the same time.
> >>
> >> 4. producer wakes up consumer every half a round ahead to give it a chance
> >>     to retrieve data. However, for an overwrite-mode ring buffer, users
> >>     typically only cares about the ring buffer snapshot before a fault occurs.
> >>     In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
> >>     to avoid unnecessary wakeups.
> >>
> >> To make it clear, here are some example diagrams.
> >>
> >> 1. Let's say we have a ring buffer with size 4096.
> >>
> >>      At first, {producer,overwrite,pending,consumer}_pos are all set to 0
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |                                                                       |
> >>      |                                                                       |
> >>      |                                                                       |
> >>      +-----------------------------------------------------------------------+
> >>      ^
> >>      |
> >>      |
> >> producer_pos = 0
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 2. Reserve event A, size 512.
> >>
> >>      There is enough free space, so A is allocated at offset 0 and producer_pos
> >>      is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
> >>      set.
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |        |                                                              |
> >>      |   A    |                                                              |
> >>      | [BUSY] |                                                              |
> >>      +-----------------------------------------------------------------------+
> >>      ^        ^
> >>      |        |
> >>      |        |
> >>      |    producer_pos = 512
> >>      |
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 3. Reserve event B, size 1024.
> >>
> >>      B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
> >>      to the end of B.
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |        |                 |                                            |
> >>      |   A    |        B        |                                            |
> >>      | [BUSY] |      [BUSY]     |                                            |
> >>      +-----------------------------------------------------------------------+
> >>      ^                          ^
> >>      |                          |
> >>      |                          |
> >>      |                   producer_pos = 1536
> >>      |
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 4. Reserve event C, size 2048.
> >>
> >>      C is allocated at offset 1536 and producer_pos becomes 3584.
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |        |                 |                                   |        |
> >>      |    A   |        B        |                 C                 |        |
> >>      | [BUSY] |      [BUSY]     |               [BUSY]              |        |
> >>      +-----------------------------------------------------------------------+
> >>      ^                                                              ^
> >>      |                                                              |
> >>      |                                                              |
> >>      |                                                    producer_pos = 3584
> >>      |
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 5. Submit event A.
> >>
> >>      The BUSY bit of A is cleared. B becomes the oldest event under writing, so
> >
> > Now it's "under writing" :) To be committed? Or "pending committing"
> > or just "pending", I guess. But not under anything, it just confuses
> > readers. IMO.
> >
>
> Once again, 'oldest event to be committed'.
>
> I should check it with an AI agent first.
>
> >>      pending_pos is moved to 512, the start of B.
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |        |                 |                                   |        |
> >>      |    A   |        B        |                 C                 |        |
> >>      |        |      [BUSY]     |               [BUSY]              |        |
> >>      +-----------------------------------------------------------------------+
> >>      ^        ^                                                     ^
> >>      |        |                                                     |
> >>      |        |                                                     |
> >>      |   pending_pos = 512                                  producer_pos = 3584
> >>      |
> >> overwrite_pos = 0
> >> consumer_pos = 0
> >>
> >> 6. Submit event B.
> >>
> >>      The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
> >>      which is the oldest event under writing now.
> >
> > ditto
> >
>
> Again and again :(
>
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |        |                 |                                   |        |
> >>      |    A   |        B        |                 C                 |        |
> >>      |        |                 |               [BUSY]              |        |
> >>      +-----------------------------------------------------------------------+
> >>      ^                          ^                                   ^
> >>      |                          |                                   |
> >>      |                          |                                   |
> >>      |                     pending_pos = 1536               producer_pos = 3584
> >>      |
> >> overwrite_pos = 0
> >> consumer_pos = 0
> >>
> >> 7. Reserve event D, size 1536 (3 * 512).
> >>
> >>      There are 2048 bytes not under writing between producer_pos and pending_pos,
> >>      so D is allocated at offset 3584, and producer_pos is moved from 3584 to
> >>      5120.
> >>
> >>      Since event D will overwrite all bytes of event A and the begining 512 bytes
> >
> > typo: beginning, but really "first 512 bytes" would be clearer
> >
>
> OK, I’ll switch to 'first 512 bytes' for clarity.
>
> >>      of event B, overwrite_pos is moved to the start of event C, the oldest event
> >>      that is not overwritten.
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |                 |        |                                   |        |
> >>      |      D End      |        |                 C                 | D Begin|
> >>      |      [BUSY]     |        |               [BUSY]              | [BUSY] |
> >>      +-----------------------------------------------------------------------+
> >>      ^                 ^        ^
> >>      |                 |        |
> >>      |                 |   pending_pos = 1536
> >>      |                 |   overwrite_pos = 1536
> >>      |                 |
> >>      |             producer_pos=5120
> >>      |
> >> consumer_pos = 0
> >>
> >> 8. Reserve event E, size 1024.
> >>
> >>      Though there are 512 bytes not under writing between producer_pos and
> >>      pending_pos, E can not be reserved, as it would overwrite the first 512
> >>      bytes of event C, which is still under writing.
> >>
> >> 9. Submit event C and D.
> >>
> >>      pending_pos is moved to the end of D.
> >>
> >>      0       512      1024    1536     2048     2560     3072     3584       4096
> >>      +-----------------------------------------------------------------------+
> >>      |                 |        |                                   |        |
> >>      |      D End      |        |                 C                 | D Begin|
> >>      |                 |        |                                   |        |
> >>      +-----------------------------------------------------------------------+
> >>      ^                 ^        ^
> >>      |                 |        |
> >>      |                 |   overwrite_pos = 1536
> >>      |                 |
> >>      |             producer_pos=5120
> >>      |             pending_pos=5120
> >>      |
> >> consumer_pos = 0
> >>
> >> The performance data for overwrite mode will be provided in a follow-up
> >> patch that adds overwrite mode benchs.
> >>
> >> A sample of performance data for non-overwrite mode on an x86_64 and arm64
> >> CPU, before and after this patch, is shown below. As we can see, no obvious
> >> performance regression occurs.
> >>

[...]

> >> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
> >> ---
> >>   include/uapi/linux/bpf.h       |   4 +
> >>   kernel/bpf/ringbuf.c           | 159 +++++++++++++++++++++++++++------
> >>   tools/include/uapi/linux/bpf.h |   4 +
> >>   3 files changed, 141 insertions(+), 26 deletions(-)
> >>

[...]

> >> +
> >> +       over_pos = READ_ONCE(rb->overwrite_pos);
> >> +       return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
> >
> > I'm trying to understand why you need to min with `rb->mask + 1`, can
> > you please elaborate?
>
>
> We need the min because rb->producer_pos and rb->overwrite_pos are read
> at different times. During this gap, a fast producer may wrap once or
> more, making over_pos larger than prod_pos.
>

what if you read overwrite_pos before reading producer_pos? Then it
can't be larger than producer_pos and available data would be
producer_pos - max(consumer_pos, overwrite_pos)? would that work?


> > And also, at least for consistency, use smp_load_acquire() for overwrite_pos?
> >
>
> Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(),
> where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos).
> So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos)
> to ensure a consistent view of the ring buffer.
>
> For consistency when reading consumer_pos and producer_pos, I’m fine with
> switching READ_ONCE to smp_load_acquire for overwrite_pos.
>

I'm not sure it matters much, but this function is called outside of
rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while
doing that WRITE_ONCE(). So it might not make any difference, but I
have mild preference for smp_load_acquire() here.

> >>   }
> >>
> >>   static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
> >> @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
> >>          return (void*)((addr & PAGE_MASK) - off);
> >>   }
> >>
> >> +

[...]

> >> +       /* In overwrite mode, move overwrite_pos to the next record to be
> >> +        * overwritten if the ring buffer is full
> >> +        */
> >
> > hm... here I think the important point is that we search for the next
> > record boundary until which we need to overwrite data such that it
> > fits newly reserved record. "next record to be overwritten" isn't that
> > important (we might never need to overwrite it). Important are those
> > aspects of a) staying on record boundary and b) consuming enough
> > records to reserve the new one.
> >
> > Can you please update the comment to mention the above points?
> >
>
> Sure, I'll update the comment to:
>
> In overwrite mode, advance overwrite_pos when the ring buffer is full.
> The key points are to stay on record boundaries and consume enough
> records to fit the new one.
>

ok

[...]

>
> >> +                          unsigned long rec_pos,
> >> +                          unsigned long cons_pos,
> >> +                          u32 len, u64 flags)
> >> +{
> >> +       unsigned long rec_end;
> >> +
> >> +       if (flags & BPF_RB_FORCE_WAKEUP)
> >> +               return true;
> >> +
> >> +       if (flags & BPF_RB_NO_WAKEUP)
> >> +               return false;
> >> +
> >> +       /* for non-overwrite mode, if consumer caught up and is waiting for
> >> +        * our record, notify about new data availability
> >> +        */
> >> +       if (likely(!rb->overwrite_mode))
> >> +               return cons_pos == rec_pos;
> >> +
> >> +       /* for overwrite mode, to give the consumer a chance to catch up
> >> +        * before being overwritten, wake up consumer every half a round
> >> +        * ahead.
> >> +        */
> >> +       rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
> >> +
> >> +       cons_pos &= (rb->mask >> 1);
> >> +       rec_pos &= (rb->mask >> 1);
> >> +       rec_end &= (rb->mask >> 1);
> >> +
> >> +       if (cons_pos == rec_pos)
> >> +               return true;
> >> +
> >> +       if (rec_pos < cons_pos && cons_pos < rec_end)
> >> +               return true;
> >> +
> >> +       if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
> >> +               return true;
> >> +
> >
> > hm... ok, let's discuss this. Why do we need to do some half-round
> > heuristic for overwrite mode? If a consumer is falling behind it
> > should be actively trying to catch up and they don't need notification
> > (that's the non-overwrite mode logic already).
> >
> > So there is more to this than a brief comment you left, can you please
> > elaborate?
> >
>
> The half-round wakeup was originally intended to work with libbpf in the
> v1 version. In that version, libbpf used a retry loop to safely copy data
> from the ring buffer that hadn’t been overwritten. By waking the consumer
> once every half round, there was always a period where the consumer and
> producer did not overlap, which helped reduce the number of retries.

I can't say I completely grok the logic here, but do you think we
should still keep this half-round wakeup? It looks like an arbitrary
heuristic, so I'd rather not have it.

>
> > pw-bot: cr
> >
> >> +       return false;
> >> +}
> >> +
> >> +static __always_inline
> >
> > we didn't have always_inline before, any strong reason to add it now?
> >
>
> I just wanted to avoid introducing any performance regression. Before this
> patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but
> after the patch it wasn’t, so I added always_inline explicitly to keep it
> inlined.

how big of a difference was it in benchmarks? It's generally frowned
upon using __always_inline without a good reason.

[...]