linux-kernel - Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27a11141-8e0d-498a-bc27-318d108161c8@jordanrome.com>
Date: Wed, 13 Aug 2025 09:22:57 -0400
From: Jordan Rome <linux@...danrome.com>
To: Xu Kuohai <xukuohai@...weicloud.com>,
 Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: bpf <bpf@...r.kernel.org>,
 "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
 <eddyz87@...il.com>, Yonghong Song <yhs@...com>, Song Liu <song@...nel.org>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Mykola Lysenko <mykolal@...com>,
 Shuah Khan <shuah@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
 Willem de Bruijn <willemb@...gle.com>, Jason Xing
 <kerneljasonxing@...il.com>, Paul Chaignon <paul.chaignon@...il.com>,
 Tao Chen <chen.dylane@...ux.dev>, Kumar Kartikeya Dwivedi
 <memxor@...il.com>, Martin Kelly <martin.kelly@...wdstrike.com>
Subject: Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer


On 8/12/25 12:02 AM, Xu Kuohai wrote:
> On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
>> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@...weicloud.com> 
>> wrote:
>>>
>>> From: Xu Kuohai <xukuohai@...wei.com>
>>>
>>> When the bpf ring buffer is full, new events can not be recorded util
>>> the consumer consumes some events to free space. This may cause 
>>> critical
>>> events to be discarded, such as in fault diagnostic, where recent 
>>> events
>>> are more critical than older ones.
>>>
>>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>>> overwrites the oldest event when the buffer is full.
>>>
>>> The scheme is as follows:
>>>
>>> 1. producer_pos tracks the next position to write new data. When there
>>>     is enough free space, producer simply moves producer_pos forward to
>>>     make space for the new event.
>>>
>>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>>     a new variable overwrite_pos is introduced for producer. 
>>> overwrite_pos
>>>     tracks the next event to be overwritten (the oldest event 
>>> committed) in
>>>     the buffer. producer moves it forward to discard the oldest 
>>> events when
>>>     the buffer is full.
>>>
>>> 3. pending_pos tracks the oldest event under committing. producer 
>>> ensures
>>>     producers_pos never passes pending_pos when making space for new 
>>> events.
>>>     So multiple producers never write to the same position at the 
>>> same time.
>>>
>>> 4. producer wakes up consumer every half a round ahead to give it a 
>>> chance
>>>     to retrieve data. However, for an overwrite-mode ring buffer, users
>>>     typically only cares about the ring buffer snapshot before a 
>>> fault occurs.
>>>     In this case, the producer should commit data with 
>>> BPF_RB_NO_WAKEUP flag
>>>     to avoid unnecessary wakeups.
>>
>> If I understand it correctly the algorithm requires all events to be 
>> the same
>> size otherwise first overwrite might trash the header,
>> also the producers should use some kind of signaling to
>> timestamp each event otherwise it all will look out of order to the 
>> consumer.
>>
>> At the end it looks inferior to the existing perf ring buffer with 
>> overwrite.
>> Since in both cases the out of order needs to be dealt with
>> in post processing the main advantage of ring buf vs perf buf is gone.
>
> No, the advantage is not gone.
>
> The ring buffer is still shared by multiple producers. When an event 
> occurs,
> the producer queues up to acquire the spin lock of the ring buffer to 
> write
> event to it. So events in the ring buffer are always ordered, no out 
> of order
> occurs.
>
> And events are not required to be the same size. When an overwrite 
> happens,
> the events bing trashed are discared, and the overwrite_pos is moved 
> forward
> to skip these events until it reaches the first event that is not 
> trashed.
>
> To make it clear, here are some example diagrams.
>
> 1. Let's say we have a ring buffer with size 4096.
>
>    At first, {producer,overwrite,pending,consumer}_pos are all set to 0
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
> | |
> | |
> | |
> +-----------------------------------------------------------------------+
>    ^
>    |
>    |
> producer_pos = 0
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 2. Reserve event A, size 512.
>
>    There is enough free space, so A is allocated at offset 0 and 
> producer_pos
>    is moved to 512, the end of A. Since A is not submitted, the BUSY 
> bit is
>    set.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    | |                                                              |
>    |   A |                                                              |
>    | [BUSY] 
> |                                                              |
> +-----------------------------------------------------------------------+
>    ^        ^
>    |        |
>    |        |
>    |    producer_pos = 512
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
>
> 3. Reserve event B, size 1024.
>
>    B is allocated at offset 512 with BUSY bit set, and producer_pos is 
> moved
>    to the end of B.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        | |                                            |
>    |   A    |        B |                                            |
>    | [BUSY] |      [BUSY] |                                            |
> +-----------------------------------------------------------------------+
>    ^                          ^
>    |                          |
>    |                          |
>    |                   producer_pos = 1536
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 4. Reserve event C, size 2048.
>
>    C is allocated at offset 1536 and producer_pos becomes 3584.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        |                 | |        |
>    |    A   |        B        |                 C |        |
>    | [BUSY] |      [BUSY]     |               [BUSY] |        |
> +-----------------------------------------------------------------------+
>    ^ ^
>    | |
>    | |
>    | producer_pos = 3584
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 5. Submit event A.
>
>    The BUSY bit of A is cleared. B becomes the oldest event under 
> writing, so
>    pending_pos is moved to 512, the start of B.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        |                 | |        |
>    |    A   |        B        |                 C |        |
>    |        |      [BUSY]     |               [BUSY] |        |
> +-----------------------------------------------------------------------+
>    ^        ^ ^
>    |        | |
>    |        | |
>    |   pending_pos = 512 producer_pos = 3584
>    |
> overwrite_pos = 0
> consumer_pos = 0
>
> 6. Submit event B.
>
>    The BUSY bit of B is cleared, and pending_pos is moved to the start 
> of C,
>    which is the oldest event under writing now.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        |                 | |        |
>    |    A   |        B        |                 C |        |
>    |        |                 |               [BUSY] |        |
> +-----------------------------------------------------------------------+
>    ^                          ^ ^
>    |                          | |
>    |                          | |
>    |                     pending_pos = 1536 producer_pos = 3584
>    |
> overwrite_pos = 0
> consumer_pos = 0
>
> 7. Reserve event D, size 1536 (3 * 512).
>
>    There are 2048 bytes not under writing between producer_pos and 
> pending_pos,
>    so D is allocated at offset 3584, and producer_pos is moved from 
> 3584 to
>    5120.
>
>    Since event D will overwrite all bytes of event A and the begining 
> 512 bytes
>    of event B, overwrite_pos is moved to the start of event C, the 
> oldest event
>    that is not overwritten.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |                 |        | |        |
>    |      D End      |        |                 C | D Begin|
>    |      [BUSY]     |        |               [BUSY] | [BUSY] |
> +-----------------------------------------------------------------------+
>    ^                 ^        ^
>    |                 |        |
>    |                 |   pending_pos = 1536
>    |                 |   overwrite_pos = 1536
>    |                 |
>    |             producer_pos=5120
>    |
> consumer_pos = 0
>
> 8. Reserve event E, size 1024.
>
>    Though there are 512 bytes not under writing between producer_pos and
>    pending_pos, E can not be reserved, as it would overwrite the first 
> 512
>    bytes of event C, which is still under writing.
>
> 9. Submit event C and D.
>
>    pending_pos is moved to the end of D.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |                 |        | |        |
>    |      D End      |        |                 C | D Begin|
>    |                 |        | |        |
> +-----------------------------------------------------------------------+
>    ^                 ^        ^
>    |                 |        |
>    |                 |   overwrite_pos = 1536
>    |                 |
>    |             producer_pos=5120
>    |             pending_pos=5120
>    |
> consumer_pos = 0

These diagrams are very helpful in terms of understanding the flow.
In part 7 when A is overwritten by D, why doesn't the consumer position 
move forward to
point to the beginning of C? If the ring buffer producer guarantees 
ordering of reserved
slots then C, in this case, is now the oldest reserved. This speaks to 
your second patch
where you say that the consumer resolves conflicts by discarding data 
that has been
overwritten but I feel like the simpler thing to do is just move the 
consumer position.