linux-kernel - Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1f1d98bc-2243-44c9-94e3-3594d19ea313@huaweicloud.com>
Date: Thu, 14 Aug 2025 21:59:52 +0800
From: Xu Kuohai <xukuohai@...weicloud.com>
To: Jordan Rome <linux@...danrome.com>,
 Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: bpf <bpf@...r.kernel.org>,
 "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
 <eddyz87@...il.com>, Yonghong Song <yhs@...com>, Song Liu <song@...nel.org>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Mykola Lysenko <mykolal@...com>,
 Shuah Khan <shuah@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
 Willem de Bruijn <willemb@...gle.com>, Jason Xing
 <kerneljasonxing@...il.com>, Paul Chaignon <paul.chaignon@...il.com>,
 Tao Chen <chen.dylane@...ux.dev>, Kumar Kartikeya Dwivedi
 <memxor@...il.com>, Martin Kelly <martin.kelly@...wdstrike.com>
Subject: Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer

On 8/13/2025 9:22 PM, Jordan Rome wrote:
> 
> On 8/12/25 12:02 AM, Xu Kuohai wrote:
>> On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
>>> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>>>>
>>>> From: Xu Kuohai <xukuohai@...wei.com>
>>>>
>>>> When the bpf ring buffer is full, new events can not be recorded util
>>>> the consumer consumes some events to free space. This may cause critical
>>>> events to be discarded, such as in fault diagnostic, where recent events
>>>> are more critical than older ones.
>>>>
>>>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>>>> overwrites the oldest event when the buffer is full.
>>>>
>>>> The scheme is as follows:
>>>>
>>>> 1. producer_pos tracks the next position to write new data. When there
>>>>     is enough free space, producer simply moves producer_pos forward to
>>>>     make space for the new event.
>>>>
>>>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>>>     a new variable overwrite_pos is introduced for producer. overwrite_pos
>>>>     tracks the next event to be overwritten (the oldest event committed) in
>>>>     the buffer. producer moves it forward to discard the oldest events when
>>>>     the buffer is full.
>>>>
>>>> 3. pending_pos tracks the oldest event under committing. producer ensures
>>>>     producers_pos never passes pending_pos when making space for new events.
>>>>     So multiple producers never write to the same position at the same time.
>>>>
>>>> 4. producer wakes up consumer every half a round ahead to give it a chance
>>>>     to retrieve data. However, for an overwrite-mode ring buffer, users
>>>>     typically only cares about the ring buffer snapshot before a fault occurs.
>>>>     In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
>>>>     to avoid unnecessary wakeups.
>>>
>>> If I understand it correctly the algorithm requires all events to be the same
>>> size otherwise first overwrite might trash the header,
>>> also the producers should use some kind of signaling to
>>> timestamp each event otherwise it all will look out of order to the consumer.
>>>
>>> At the end it looks inferior to the existing perf ring buffer with overwrite.
>>> Since in both cases the out of order needs to be dealt with
>>> in post processing the main advantage of ring buf vs perf buf is gone.
>>
>> No, the advantage is not gone.
>>
>> The ring buffer is still shared by multiple producers. When an event occurs,
>> the producer queues up to acquire the spin lock of the ring buffer to write
>> event to it. So events in the ring buffer are always ordered, no out of order
>> occurs.
>>
>> And events are not required to be the same size. When an overwrite happens,
>> the events bing trashed are discared, and the overwrite_pos is moved forward
>> to skip these events until it reaches the first event that is not trashed.
>>
>> To make it clear, here are some example diagrams.
>>
>> 1. Let's say we have a ring buffer with size 4096.
>>
>>    At first, {producer,overwrite,pending,consumer}_pos are all set to 0
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | |
>> | |
>> | |
>> +-----------------------------------------------------------------------+
>>    ^
>>    |
>>    |
>> producer_pos = 0
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 2. Reserve event A, size 512.
>>
>>    There is enough free space, so A is allocated at offset 0 and producer_pos
>>    is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
>>    set.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    | |                                                              |
>>    |   A |                                                              |
>>    | [BUSY] |                                                              |
>> +-----------------------------------------------------------------------+
>>    ^        ^
>>    |        |
>>    |        |
>>    |    producer_pos = 512
>>    |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>>
>> 3. Reserve event B, size 1024.
>>
>>    B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
>>    to the end of B.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        | |                                            |
>>    |   A    |        B |                                            |
>>    | [BUSY] |      [BUSY] |                                            |
>> +-----------------------------------------------------------------------+
>>    ^                          ^
>>    |                          |
>>    |                          |
>>    |                   producer_pos = 1536
>>    |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 4. Reserve event C, size 2048.
>>
>>    C is allocated at offset 1536 and producer_pos becomes 3584.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        |                 | |        |
>>    |    A   |        B        |                 C |        |
>>    | [BUSY] |      [BUSY]     |               [BUSY] |        |
>> +-----------------------------------------------------------------------+
>>    ^ ^
>>    | |
>>    | |
>>    | producer_pos = 3584
>>    |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 5. Submit event A.
>>
>>    The BUSY bit of A is cleared. B becomes the oldest event under writing, so
>>    pending_pos is moved to 512, the start of B.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        |                 | |        |
>>    |    A   |        B        |                 C |        |
>>    |        |      [BUSY]     |               [BUSY] |        |
>> +-----------------------------------------------------------------------+
>>    ^        ^ ^
>>    |        | |
>>    |        | |
>>    |   pending_pos = 512 producer_pos = 3584
>>    |
>> overwrite_pos = 0
>> consumer_pos = 0
>>
>> 6. Submit event B.
>>
>>    The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
>>    which is the oldest event under writing now.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        |                 | |        |
>>    |    A   |        B        |                 C |        |
>>    |        |                 |               [BUSY] |        |
>> +-----------------------------------------------------------------------+
>>    ^                          ^ ^
>>    |                          | |
>>    |                          | |
>>    |                     pending_pos = 1536 producer_pos = 3584
>>    |
>> overwrite_pos = 0
>> consumer_pos = 0
>>
>> 7. Reserve event D, size 1536 (3 * 512).
>>
>>    There are 2048 bytes not under writing between producer_pos and pending_pos,
>>    so D is allocated at offset 3584, and producer_pos is moved from 3584 to
>>    5120.
>>
>>    Since event D will overwrite all bytes of event A and the begining 512 bytes
>>    of event B, overwrite_pos is moved to the start of event C, the oldest event
>>    that is not overwritten.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |                 |        | |        |
>>    |      D End      |        |                 C | D Begin|
>>    |      [BUSY]     |        |               [BUSY] | [BUSY] |
>> +-----------------------------------------------------------------------+
>>    ^                 ^        ^
>>    |                 |        |
>>    |                 |   pending_pos = 1536
>>    |                 |   overwrite_pos = 1536
>>    |                 |
>>    |             producer_pos=5120
>>    |
>> consumer_pos = 0
>>
>> 8. Reserve event E, size 1024.
>>
>>    Though there are 512 bytes not under writing between producer_pos and
>>    pending_pos, E can not be reserved, as it would overwrite the first 512
>>    bytes of event C, which is still under writing.
>>
>> 9. Submit event C and D.
>>
>>    pending_pos is moved to the end of D.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |                 |        | |        |
>>    |      D End      |        |                 C | D Begin|
>>    |                 |        | |        |
>> +-----------------------------------------------------------------------+
>>    ^                 ^        ^
>>    |                 |        |
>>    |                 |   overwrite_pos = 1536
>>    |                 |
>>    |             producer_pos=5120
>>    |             pending_pos=5120
>>    |
>> consumer_pos = 0
> 
> These diagrams are very helpful in terms of understanding the flow.
> In part 7 when A is overwritten by D, why doesn't the consumer position move forward to
> point to the beginning of C? If the ring buffer producer guarantees ordering of reserved
> slots then C, in this case, is now the oldest reserved. This speaks to your second patch
> where you say that the consumer resolves conflicts by discarding data that has been
> overwritten but I feel like the simpler thing to do is just move the consumer position.
> 

But the consumer may be ahead of overwrite_pos. In this case, moving
consumer_pos back to the oldest event is not correct, as the event has
already been consumed.