[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYAUDFH7z9x-+vkzkHD0pSG6M264yQoMCGkJxg3mFvcYA@mail.gmail.com>
Date: Mon, 13 Oct 2025 16:22:52 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yhs@...com>,
Song Liu <song@...nel.org>
Subject: Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring buffer
On Fri, Oct 10, 2025 at 12:07 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:
>
> [...]
>
> >>>> +
> >>>> + over_pos = READ_ONCE(rb->overwrite_pos);
> >>>> + return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
> >>>
> >>> I'm trying to understand why you need to min with `rb->mask + 1`, can
> >>> you please elaborate?
> >>
> >>
> >> We need the min because rb->producer_pos and rb->overwrite_pos are read
> >> at different times. During this gap, a fast producer may wrap once or
> >> more, making over_pos larger than prod_pos.
> >>
> >
> > what if you read overwrite_pos before reading producer_pos? Then it
> > can't be larger than producer_pos and available data would be
> > producer_pos - max(consumer_pos, overwrite_pos)? would that work?
> >
>
> No, it won’t work. Between reading overwrite_pos and producer_pos, producer
> on a different CPU may have already moved producer_pos forward by more than
> one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed
> the ring buffer size.
True, but that was the case with this function before as well.
ringbuf_avail_data_sz() is giving an estimate, we just need to make
sure to not return a negative value. We didn't artificially cap the
return to ring buf size before, why starting now? All of this
calculation is done outside of the lock anyways, so it can never be
relied upon for exactness.
>
> >
> >>> And also, at least for consistency, use smp_load_acquire() for overwrite_pos?
> >>>
> >>
> >> Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(),
> >> where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos).
> >> So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos)
> >> to ensure a consistent view of the ring buffer.
> >>
> >> For consistency when reading consumer_pos and producer_pos, I’m fine with
> >> switching READ_ONCE to smp_load_acquire for overwrite_pos.
> >>
> >
> > I'm not sure it matters much, but this function is called outside of
> > rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while
> > doing that WRITE_ONCE(). So it might not make any difference, but I
> > have mild preference for smp_load_acquire() here.
> >
>
> OK, I'll switch to smp_load_acquire.
>
> >>>> }
> >>>>
> >>>> static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
> >>>> @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
> >>>> return (void*)((addr & PAGE_MASK) - off);
> >>>> }
> >>>>
> >>>> +
> >
> > [...]
> >
> >>>> + /* In overwrite mode, move overwrite_pos to the next record to be
> >>>> + * overwritten if the ring buffer is full
> >>>> + */
> >>>
> >>> hm... here I think the important point is that we search for the next
> >>> record boundary until which we need to overwrite data such that it
> >>> fits newly reserved record. "next record to be overwritten" isn't that
> >>> important (we might never need to overwrite it). Important are those
> >>> aspects of a) staying on record boundary and b) consuming enough
> >>> records to reserve the new one.
> >>>
> >>> Can you please update the comment to mention the above points?
> >>>
> >>
> >> Sure, I'll update the comment to:
> >>
> >> In overwrite mode, advance overwrite_pos when the ring buffer is full.
> >> The key points are to stay on record boundaries and consume enough
> >> records to fit the new one.
> >>
> >
> > ok
> >
> > [...]
> >
> >>
> >>>> + unsigned long rec_pos,
> >>>> + unsigned long cons_pos,
> >>>> + u32 len, u64 flags)
> >>>> +{
> >>>> + unsigned long rec_end;
> >>>> +
> >>>> + if (flags & BPF_RB_FORCE_WAKEUP)
> >>>> + return true;
> >>>> +
> >>>> + if (flags & BPF_RB_NO_WAKEUP)
> >>>> + return false;
> >>>> +
> >>>> + /* for non-overwrite mode, if consumer caught up and is waiting for
> >>>> + * our record, notify about new data availability
> >>>> + */
> >>>> + if (likely(!rb->overwrite_mode))
> >>>> + return cons_pos == rec_pos;
> >>>> +
> >>>> + /* for overwrite mode, to give the consumer a chance to catch up
> >>>> + * before being overwritten, wake up consumer every half a round
> >>>> + * ahead.
> >>>> + */
> >>>> + rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
> >>>> +
> >>>> + cons_pos &= (rb->mask >> 1);
> >>>> + rec_pos &= (rb->mask >> 1);
> >>>> + rec_end &= (rb->mask >> 1);
> >>>> +
> >>>> + if (cons_pos == rec_pos)
> >>>> + return true;
> >>>> +
> >>>> + if (rec_pos < cons_pos && cons_pos < rec_end)
> >>>> + return true;
> >>>> +
> >>>> + if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
> >>>> + return true;
> >>>> +
> >>>
> >>> hm... ok, let's discuss this. Why do we need to do some half-round
> >>> heuristic for overwrite mode? If a consumer is falling behind it
> >>> should be actively trying to catch up and they don't need notification
> >>> (that's the non-overwrite mode logic already).
> >>>
> >>> So there is more to this than a brief comment you left, can you please
> >>> elaborate?
> >>>
> >>
> >> The half-round wakeup was originally intended to work with libbpf in the
> >> v1 version. In that version, libbpf used a retry loop to safely copy data
> >> from the ring buffer that hadn’t been overwritten. By waking the consumer
> >> once every half round, there was always a period where the consumer and
> >> producer did not overlap, which helped reduce the number of retries.
> >
> > I can't say I completely grok the logic here, but do you think we
> > should still keep this half-round wakeup? It looks like an arbitrary
> > heuristic, so I'd rather not have it.
> >
>
> Sure, since the related libbpf code is no longer present, I’ll remove this
> logic in the next version.
>
> >>
> >>> pw-bot: cr
> >>>
> >>>> + return false;
> >>>> +}
> >>>> +
> >>>> +static __always_inline
> >>>
> >>> we didn't have always_inline before, any strong reason to add it now?
> >>>
> >>
> >> I just wanted to avoid introducing any performance regression. Before this
> >> patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but
> >> after the patch it wasn’t, so I added always_inline explicitly to keep it
> >> inlined.
> >
> > how big of a difference was it in benchmarks? It's generally frowned
> > upon using __always_inline without a good reason.
> >
>
> The difference is not noticeable on my arm64 test machine, but it is on my
> amd machine.
>
> Below is the benchmark data on AMD EPYC 9654, with and without always_inline
> attribute.
>
> - With always_inline
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1 13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2 15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3 7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4 6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8 3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>
> - Without always_inline
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1 10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2 14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3 7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4 6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8 3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>
> When nr_prod=1, the performance regression is significant, dropping from
> 13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.
>
> However, since the half-round wakeup logic will be removed in the next
> version, the changes to bpf_ringbuf_commit, including always_inline, will
> also be removed.
Ok, thanks for the data! Please send an updated version of the code,
and let's have another round of review, thanks!
>
> > [...]
>
Powered by blists - more mailing lists