[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <445861b0-b79b-425f-b230-707c17093e70@huaweicloud.com>
Date: Tue, 14 Oct 2025 21:35:18 +0800
From: Xu Kuohai <xukuohai@...weicloud.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: bpf@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
<eddyz87@...il.com>, Yonghong Song <yhs@...com>, Song Liu <song@...nel.org>
Subject: Re: [PATCH bpf-next v2 1/3] bpf: Add overwrite mode for bpf ring
buffer
On 10/14/2025 7:22 AM, Andrii Nakryiko wrote:
> On Fri, Oct 10, 2025 at 12:07 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>>
>> On 10/7/2025 6:10 AM, Andrii Nakryiko wrote:
>>
>> [...]
>>
>>>>>> +
>>>>>> + over_pos = READ_ONCE(rb->overwrite_pos);
>>>>>> + return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
>>>>>
>>>>> I'm trying to understand why you need to min with `rb->mask + 1`, can
>>>>> you please elaborate?
>>>>
>>>>
>>>> We need the min because rb->producer_pos and rb->overwrite_pos are read
>>>> at different times. During this gap, a fast producer may wrap once or
>>>> more, making over_pos larger than prod_pos.
>>>>
>>>
>>> what if you read overwrite_pos before reading producer_pos? Then it
>>> can't be larger than producer_pos and available data would be
>>> producer_pos - max(consumer_pos, overwrite_pos)? would that work?
>>>
>>
>> No, it won’t work. Between reading overwrite_pos and producer_pos, producer
>> on a different CPU may have already moved producer_pos forward by more than
>> one ring buffer size, causing prod_pos - max(cons_pos, over_pos) to exceed
>> the ring buffer size.
>
> True, but that was the case with this function before as well.
> ringbuf_avail_data_sz() is giving an estimate, we just need to make
> sure to not return a negative value. We didn't artificially cap the
> return to ring buf size before, why starting now? All of this
> calculation is done outside of the lock anyways, so it can never be
> relied upon for exactness.
>
Makes sense, will switch to producer_pos - max(consumer_pos, overwrite_pos).
>>
>>>
>>>>> And also, at least for consistency, use smp_load_acquire() for overwrite_pos?
>>>>>
>>>>
>>>> Using READ_ONCE here is to stay symmetric with __bpf_ringbuf_reserve(),
>>>> where overwrite_pos is WRITE_ONCE first, followed by smp_store_release(producer_pos).
>>>> So here we do smp_load_acquire(producer_pos) first, then READ_ONCE(overwrite_pos)
>>>> to ensure a consistent view of the ring buffer.
>>>>
>>>> For consistency when reading consumer_pos and producer_pos, I’m fine with
>>>> switching READ_ONCE to smp_load_acquire for overwrite_pos.
>>>>
>>>
>>> I'm not sure it matters much, but this function is called outside of
>>> rb->spinlock, while __bpf_ringbuf_reserve() does hold a lock while
>>> doing that WRITE_ONCE(). So it might not make any difference, but I
>>> have mild preference for smp_load_acquire() here.
>>>
>>
>> OK, I'll switch to smp_load_acquire.
>>
>>>>>> }
>>>>>>
>>>>>> static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
>>>>>> @@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
>>>>>> return (void*)((addr & PAGE_MASK) - off);
>>>>>> }
>>>>>>
>>>>>> +
>>>
>>> [...]
>>>
>>>>>> + /* In overwrite mode, move overwrite_pos to the next record to be
>>>>>> + * overwritten if the ring buffer is full
>>>>>> + */
>>>>>
>>>>> hm... here I think the important point is that we search for the next
>>>>> record boundary until which we need to overwrite data such that it
>>>>> fits newly reserved record. "next record to be overwritten" isn't that
>>>>> important (we might never need to overwrite it). Important are those
>>>>> aspects of a) staying on record boundary and b) consuming enough
>>>>> records to reserve the new one.
>>>>>
>>>>> Can you please update the comment to mention the above points?
>>>>>
>>>>
>>>> Sure, I'll update the comment to:
>>>>
>>>> In overwrite mode, advance overwrite_pos when the ring buffer is full.
>>>> The key points are to stay on record boundaries and consume enough
>>>> records to fit the new one.
>>>>
>>>
>>> ok
>>>
>>> [...]
>>>
>>>>
>>>>>> + unsigned long rec_pos,
>>>>>> + unsigned long cons_pos,
>>>>>> + u32 len, u64 flags)
>>>>>> +{
>>>>>> + unsigned long rec_end;
>>>>>> +
>>>>>> + if (flags & BPF_RB_FORCE_WAKEUP)
>>>>>> + return true;
>>>>>> +
>>>>>> + if (flags & BPF_RB_NO_WAKEUP)
>>>>>> + return false;
>>>>>> +
>>>>>> + /* for non-overwrite mode, if consumer caught up and is waiting for
>>>>>> + * our record, notify about new data availability
>>>>>> + */
>>>>>> + if (likely(!rb->overwrite_mode))
>>>>>> + return cons_pos == rec_pos;
>>>>>> +
>>>>>> + /* for overwrite mode, to give the consumer a chance to catch up
>>>>>> + * before being overwritten, wake up consumer every half a round
>>>>>> + * ahead.
>>>>>> + */
>>>>>> + rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
>>>>>> +
>>>>>> + cons_pos &= (rb->mask >> 1);
>>>>>> + rec_pos &= (rb->mask >> 1);
>>>>>> + rec_end &= (rb->mask >> 1);
>>>>>> +
>>>>>> + if (cons_pos == rec_pos)
>>>>>> + return true;
>>>>>> +
>>>>>> + if (rec_pos < cons_pos && cons_pos < rec_end)
>>>>>> + return true;
>>>>>> +
>>>>>> + if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
>>>>>> + return true;
>>>>>> +
>>>>>
>>>>> hm... ok, let's discuss this. Why do we need to do some half-round
>>>>> heuristic for overwrite mode? If a consumer is falling behind it
>>>>> should be actively trying to catch up and they don't need notification
>>>>> (that's the non-overwrite mode logic already).
>>>>>
>>>>> So there is more to this than a brief comment you left, can you please
>>>>> elaborate?
>>>>>
>>>>
>>>> The half-round wakeup was originally intended to work with libbpf in the
>>>> v1 version. In that version, libbpf used a retry loop to safely copy data
>>>> from the ring buffer that hadn’t been overwritten. By waking the consumer
>>>> once every half round, there was always a period where the consumer and
>>>> producer did not overlap, which helped reduce the number of retries.
>>>
>>> I can't say I completely grok the logic here, but do you think we
>>> should still keep this half-round wakeup? It looks like an arbitrary
>>> heuristic, so I'd rather not have it.
>>>
>>
>> Sure, since the related libbpf code is no longer present, I’ll remove this
>> logic in the next version.
>>
>>>>
>>>>> pw-bot: cr
>>>>>
>>>>>> + return false;
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline
>>>>>
>>>>> we didn't have always_inline before, any strong reason to add it now?
>>>>>
>>>>
>>>> I just wanted to avoid introducing any performance regression. Before this
>>>> patch, bpf_ringbuf_commit() was automatically inlined by the compiler, but
>>>> after the patch it wasn’t, so I added always_inline explicitly to keep it
>>>> inlined.
>>>
>>> how big of a difference was it in benchmarks? It's generally frowned
>>> upon using __always_inline without a good reason.
>>>
>>
>> The difference is not noticeable on my arm64 test machine, but it is on my
>> amd machine.
>>
>> Below is the benchmark data on AMD EPYC 9654, with and without always_inline
>> attribute.
>>
>> - With always_inline
>>
>> Ringbuf, multi-producer contention
>> ==================================
>> rb-libbpf nr_prod 1 13.070 ± 0.158M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 2 15.440 ± 0.017M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 3 7.860 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 4 6.444 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 8 3.788 ± 0.005M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 12 2.802 ± 0.007M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 16 2.560 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 20 2.227 ± 0.006M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 24 2.141 ± 0.007M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 28 1.960 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 32 1.913 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 36 1.854 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 40 1.818 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 44 1.779 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 48 1.758 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 52 1.812 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>>
>> - Without always_inline
>>
>> Ringbuf, multi-producer contention
>> ==================================
>> rb-libbpf nr_prod 1 10.550 ± 0.032M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 2 14.661 ± 0.024M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 3 7.616 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 4 6.476 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 8 3.806 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 12 2.814 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 16 2.608 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 20 2.337 ± 0.005M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 24 2.270 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 28 1.977 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 32 1.921 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 36 1.862 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 40 1.827 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 44 1.912 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 48 1.860 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>> rb-libbpf nr_prod 52 1.824 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>
>> When nr_prod=1, the performance regression is significant, dropping from
>> 13.070 ± 0.158 M/s with always_inline to 10.550 ± 0.032 M/s without it.
>>
>> However, since the half-round wakeup logic will be removed in the next
>> version, the changes to bpf_ringbuf_commit, including always_inline, will
>> also be removed.
>
> Ok, thanks for the data! Please send an updated version of the code,
> and let's have another round of review, thanks!
>
Sure, I’ll send the next version later this week.
>>
>>> [...]
>>
>
Powered by blists - more mailing lists