[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAEf4BzbYtaPf0jjoiv16iKWRKkv9ZTH_hBiZMUF+PkjVGOC53A@mail.gmail.com>
Date: Mon, 6 Oct 2025 15:10:54 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yhs@...com>,
Song Liu <song@...nel.org>
Subject: Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer
On Mon, Sep 29, 2025 at 6:41 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> On 9/20/2025 6:10 AM, Andrii Nakryiko wrote:
> > On Fri, Sep 5, 2025 at 8:13 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
> >>
> >> From: Xu Kuohai <xukuohai@...wei.com>
> >>
> >> When the bpf ring buffer is full, new events can not be recorded util
> >
> > typo: until
> >
>
> ACK
>
> >> the consumer consumes some events to free space. This may cause critical
> >> events to be discarded, such as in fault diagnostic, where recent events
> >> are more critical than older ones.
> >>
> >> So add ovewrite mode for bpf ring buffer. In this mode, the new event
> >
> > overwrite, BPF
> >
>
> ACK
>
> >> overwrites the oldest event when the buffer is full.
> >>
> >> The scheme is as follows:
> >>
> >> 1. producer_pos tracks the next position to write new data. When there
> >> is enough free space, producer simply moves producer_pos forward to
> >> make space for the new event.
> >>
> >> 2. To avoid waiting for consumer to free space when the buffer is full,
> >> a new variable overwrite_pos is introduced for producer. overwrite_pos
> >> tracks the next event to be overwritten (the oldest event committed) in
> >> the buffer. producer moves it forward to discard the oldest events when
> >> the buffer is full.
> >>
> >> 3. pending_pos tracks the oldest event under committing. producer ensures
> >
> > "under committing" is confusing. Oldest event to be committed?
> >
>
> Yes, 'the oldest event to be committed'. Thanks!
>
> >> producers_pos never passes pending_pos when making space for new events.
> >> So multiple producers never write to the same position at the same time.
> >>
> >> 4. producer wakes up consumer every half a round ahead to give it a chance
> >> to retrieve data. However, for an overwrite-mode ring buffer, users
> >> typically only cares about the ring buffer snapshot before a fault occurs.
> >> In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
> >> to avoid unnecessary wakeups.
> >>
> >> To make it clear, here are some example diagrams.
> >>
> >> 1. Let's say we have a ring buffer with size 4096.
> >>
> >> At first, {producer,overwrite,pending,consumer}_pos are all set to 0
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | |
> >> | |
> >> | |
> >> +-----------------------------------------------------------------------+
> >> ^
> >> |
> >> |
> >> producer_pos = 0
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 2. Reserve event A, size 512.
> >>
> >> There is enough free space, so A is allocated at offset 0 and producer_pos
> >> is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
> >> set.
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | |
> >> | A | |
> >> | [BUSY] | |
> >> +-----------------------------------------------------------------------+
> >> ^ ^
> >> | |
> >> | |
> >> | producer_pos = 512
> >> |
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 3. Reserve event B, size 1024.
> >>
> >> B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
> >> to the end of B.
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | | |
> >> | A | B | |
> >> | [BUSY] | [BUSY] | |
> >> +-----------------------------------------------------------------------+
> >> ^ ^
> >> | |
> >> | |
> >> | producer_pos = 1536
> >> |
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 4. Reserve event C, size 2048.
> >>
> >> C is allocated at offset 1536 and producer_pos becomes 3584.
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | | | |
> >> | A | B | C | |
> >> | [BUSY] | [BUSY] | [BUSY] | |
> >> +-----------------------------------------------------------------------+
> >> ^ ^
> >> | |
> >> | |
> >> | producer_pos = 3584
> >> |
> >> overwrite_pos = 0
> >> pending_pos = 0
> >> consumer_pos = 0
> >>
> >> 5. Submit event A.
> >>
> >> The BUSY bit of A is cleared. B becomes the oldest event under writing, so
> >
> > Now it's "under writing" :) To be committed? Or "pending committing"
> > or just "pending", I guess. But not under anything, it just confuses
> > readers. IMO.
> >
>
> Once again, 'oldest event to be committed'.
>
> I should check it with an AI agent first.
>
> >> pending_pos is moved to 512, the start of B.
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | | | |
> >> | A | B | C | |
> >> | | [BUSY] | [BUSY] | |
> >> +-----------------------------------------------------------------------+
> >> ^ ^ ^
> >> | | |
> >> | | |
> >> | pending_pos = 512 producer_pos = 3584
> >> |
> >> overwrite_pos = 0
> >> consumer_pos = 0
> >>
> >> 6. Submit event B.
> >>
> >> The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
> >> which is the oldest event under writing now.
> >
> > ditto
> >
>
> Again and again :(
>
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | | | |
> >> | A | B | C | |
> >> | | | [BUSY] | |
> >> +-----------------------------------------------------------------------+
> >> ^ ^ ^
> >> | | |
> >> | | |
> >> | pending_pos = 1536 producer_pos = 3584
> >> |
> >> overwrite_pos = 0
> >> consumer_pos = 0
> >>
> >> 7. Reserve event D, size 1536 (3 * 512).
> >>
> >> There are 2048 bytes not under writing between producer_pos and pending_pos,
> >> so D is allocated at offset 3584, and producer_pos is moved from 3584 to
> >> 5120.
> >>
> >> Since event D will overwrite all bytes of event A and the begining 512 bytes
> >
> > typo: beginning, but really "first 512 bytes" would be clearer
> >
>
> OK, I’ll switch to 'first 512 bytes' for clarity.
>
> >> of event B, overwrite_pos is moved to the start of event C, the oldest event
> >> that is not overwritten.
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | | | |
> >> | D End | | C | D Begin|
> >> | [BUSY] | | [BUSY] | [BUSY] |
> >> +-----------------------------------------------------------------------+
> >> ^ ^ ^
> >> | | |
> >> | | pending_pos = 1536
> >> | | overwrite_pos = 1536
> >> | |
> >> | producer_pos=5120
> >> |
> >> consumer_pos = 0
> >>
> >> 8. Reserve event E, size 1024.
> >>
> >> Though there are 512 bytes not under writing between producer_pos and
> >> pending_pos, E can not be reserved, as it would overwrite the first 512
> >> bytes of event C, which is still under writing.
> >>
> >> 9. Submit event C and D.
> >>
> >> pending_pos is moved to the end of D.
> >>
> >> 0 512 1024 1536 2048 2560 3072 3584 4096
> >> +-----------------------------------------------------------------------+
> >> | | | | |
> >> | D End | | C | D Begin|
> >> | | | | |
> >> +-----------------------------------------------------------------------+
> >> ^ ^ ^
> >> | | |
> >> | | overwrite_pos = 1536
> >> | |
> >> | producer_pos=5120
> >> | pending_pos=5120
> >> |
> >> consumer_pos = 0
> >>
> >> The performance data for overwrite mode will be provided in a follow-up
> >> patch that adds overwrite mode benchs.
> >>
> >> A sample of performance data for non-overwrite mode on an x86_64 and arm64
> >> CPU, before and after this patch, is shown below. As we can see, no obvious
> >> performance regression occurs.
> >>
[...]
> >> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
> >> ---
> >> include/uapi/linux/bpf.h | 4 +
> >> kernel/bpf/ringbuf.c | 159 +++++++++++++++++++++++++++------
> >> tools/include/uapi/linux/bpf.h | 4 +
> >> 3 files changed, 141 insertions(+), 26 deletions(-)
> >>
[...]
> >> +
> >> + over_pos = READ_ONCE(rb->overwrite_pos);
> >> + return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
> >
> > I'm trying to understand why you need to min with `rb->mask + 1`, can
> > you please elaborate?
>
>
> We need the min because rb->producer_pos and rb->overwrite_pos are read
> at different times. During this gap, a fast producer may wrap once or
> more, making over_pos larger than prod_pos.
>
what if you read overwrite_pos before reading producer_pos? Then it
can't be larger than producer_pos and available data would be
producer_pos - max(consumer_pos, overwrite_pos)? would that work?
> > And also, at least for consistency, use smp_load_acquire() for overwrite_pos?
> >
>
> Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(),
> where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos).
> So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos)
> to ensure a consistent view of the ring buffer.
>
> For consistency when reading consumer_pos and producer_pos, I’m fine with
> switching READ_ONCE to smp_load_acquire for overwrite_pos.
>
I'm not sure it matters much, but this function is called outside of
rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while
doing that WRITE_ONCE(). So it might not make any difference, but I
have mild preference for smp_load_acquire() here.
> >> }
> >>
> >> static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
> >> @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
> >> return (void*)((addr & PAGE_MASK) - off);
> >> }
> >>
> >> +
[...]
> >> + /* In overwrite mode, move overwrite_pos to the next record to be
> >> + * overwritten if the ring buffer is full
> >> + */
> >
> > hm... here I think the important point is that we search for the next
> > record boundary until which we need to overwrite data such that it
> > fits newly reserved record. "next record to be overwritten" isn't that
> > important (we might never need to overwrite it). Important are those
> > aspects of a) staying on record boundary and b) consuming enough
> > records to reserve the new one.
> >
> > Can you please update the comment to mention the above points?
> >
>
> Sure, I'll update the comment to:
>
> In overwrite mode, advance overwrite_pos when the ring buffer is full.
> The key points are to stay on record boundaries and consume enough
> records to fit the new one.
>
ok
[...]
>
> >> + unsigned long rec_pos,
> >> + unsigned long cons_pos,
> >> + u32 len, u64 flags)
> >> +{
> >> + unsigned long rec_end;
> >> +
> >> + if (flags & BPF_RB_FORCE_WAKEUP)
> >> + return true;
> >> +
> >> + if (flags & BPF_RB_NO_WAKEUP)
> >> + return false;
> >> +
> >> + /* for non-overwrite mode, if consumer caught up and is waiting for
> >> + * our record, notify about new data availability
> >> + */
> >> + if (likely(!rb->overwrite_mode))
> >> + return cons_pos == rec_pos;
> >> +
> >> + /* for overwrite mode, to give the consumer a chance to catch up
> >> + * before being overwritten, wake up consumer every half a round
> >> + * ahead.
> >> + */
> >> + rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
> >> +
> >> + cons_pos &= (rb->mask >> 1);
> >> + rec_pos &= (rb->mask >> 1);
> >> + rec_end &= (rb->mask >> 1);
> >> +
> >> + if (cons_pos == rec_pos)
> >> + return true;
> >> +
> >> + if (rec_pos < cons_pos && cons_pos < rec_end)
> >> + return true;
> >> +
> >> + if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
> >> + return true;
> >> +
> >
> > hm... ok, let's discuss this. Why do we need to do some half-round
> > heuristic for overwrite mode? If a consumer is falling behind it
> > should be actively trying to catch up and they don't need notification
> > (that's the non-overwrite mode logic already).
> >
> > So there is more to this than a brief comment you left, can you please
> > elaborate?
> >
>
> The half-round wakeup was originally intended to work with libbpf in the
> v1 version. In that version, libbpf used a retry loop to safely copy data
> from the ring buffer that hadn’t been overwritten. By waking the consumer
> once every half round, there was always a period where the consumer and
> producer did not overlap, which helped reduce the number of retries.
I can't say I completely grok the logic here, but do you think we
should still keep this half-round wakeup? It looks like an arbitrary
heuristic, so I'd rather not have it.
>
> > pw-bot: cr
> >
> >> + return false;
> >> +}
> >> +
> >> +static __always_inline
> >
> > we didn't have always_inline before, any strong reason to add it now?
> >
>
> I just wanted to avoid introducing any performance regression. Before this
> patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but
> after the patch it wasn’t, so I added always_inline explicitly to keep it
> inlined.
how big of a difference was it in benchmarks? It's generally frowned
upon using __always_inline without a good reason.
[...]
Powered by blists - more mailing lists