linux-kernel - Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53c46f61-2901-4225-a6e7-a82c2e6663b9@huaweicloud.com>
Date: Tue, 12 Aug 2025 12:02:16 +0800
From: Xu Kuohai <xukuohai@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: bpf <bpf@...r.kernel.org>,
 "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
 <eddyz87@...il.com>, Yonghong Song <yhs@...com>, Song Liu <song@...nel.org>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Mykola Lysenko <mykolal@...com>,
 Shuah Khan <shuah@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
 Willem de Bruijn <willemb@...gle.com>, Jason Xing
 <kerneljasonxing@...il.com>, Paul Chaignon <paul.chaignon@...il.com>,
 Tao Chen <chen.dylane@...ux.dev>, Kumar Kartikeya Dwivedi
 <memxor@...il.com>, Martin Kelly <martin.kelly@...wdstrike.com>
Subject: Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer

On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>>
>> From: Xu Kuohai <xukuohai@...wei.com>
>>
>> When the bpf ring buffer is full, new events can not be recorded util
>> the consumer consumes some events to free space. This may cause critical
>> events to be discarded, such as in fault diagnostic, where recent events
>> are more critical than older ones.
>>
>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>> overwrites the oldest event when the buffer is full.
>>
>> The scheme is as follows:
>>
>> 1. producer_pos tracks the next position to write new data. When there
>>     is enough free space, producer simply moves producer_pos forward to
>>     make space for the new event.
>>
>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>     a new variable overwrite_pos is introduced for producer. overwrite_pos
>>     tracks the next event to be overwritten (the oldest event committed) in
>>     the buffer. producer moves it forward to discard the oldest events when
>>     the buffer is full.
>>
>> 3. pending_pos tracks the oldest event under committing. producer ensures
>>     producers_pos never passes pending_pos when making space for new events.
>>     So multiple producers never write to the same position at the same time.
>>
>> 4. producer wakes up consumer every half a round ahead to give it a chance
>>     to retrieve data. However, for an overwrite-mode ring buffer, users
>>     typically only cares about the ring buffer snapshot before a fault occurs.
>>     In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
>>     to avoid unnecessary wakeups.
> 
> If I understand it correctly the algorithm requires all events to be the same
> size otherwise first overwrite might trash the header,
> also the producers should use some kind of signaling to
> timestamp each event otherwise it all will look out of order to the consumer.
> 
> At the end it looks inferior to the existing perf ring buffer with overwrite.
> Since in both cases the out of order needs to be dealt with
> in post processing the main advantage of ring buf vs perf buf is gone.

No, the advantage is not gone.

The ring buffer is still shared by multiple producers. When an event occurs,
the producer queues up to acquire the spin lock of the ring buffer to write
event to it. So events in the ring buffer are always ordered, no out of order
occurs.

And events are not required to be the same size. When an overwrite happens,
the events bing trashed are discared, and the overwrite_pos is moved forward
to skip these events until it reaches the first event that is not trashed.

To make it clear, here are some example diagrams.

1. Let's say we have a ring buffer with size 4096.

    At first, {producer,overwrite,pending,consumer}_pos are all set to 0

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |                                                                       |
    |                                                                       |
    |                                                                       |
    +-----------------------------------------------------------------------+
    ^
    |
    |
producer_pos = 0
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

2. Reserve event A, size 512.

    There is enough free space, so A is allocated at offset 0 and producer_pos
    is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
    set.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                                                              |
    |   A    |                                                              |
    | [BUSY] |                                                              |
    +-----------------------------------------------------------------------+
    ^        ^
    |        |
    |        |
    |    producer_pos = 512
    |
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0


3. Reserve event B, size 1024.

    B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
    to the end of B.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                            |
    |   A    |        B        |                                            |
    | [BUSY] |      [BUSY]     |                                            |
    +-----------------------------------------------------------------------+
    ^                          ^
    |                          |
    |                          |
    |                   producer_pos = 1536
    |
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

4. Reserve event C, size 2048.

    C is allocated at offset 1536 and producer_pos becomes 3584.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                   |        |
    |    A   |        B        |                 C                 |        |
    | [BUSY] |      [BUSY]     |               [BUSY]              |        |
    +-----------------------------------------------------------------------+
    ^                                                              ^
    |                                                              |
    |                                                              |
    |                                                    producer_pos = 3584
    |
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

5. Submit event A.

    The BUSY bit of A is cleared. B becomes the oldest event under writing, so
    pending_pos is moved to 512, the start of B.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                   |        |
    |    A   |        B        |                 C                 |        |
    |        |      [BUSY]     |               [BUSY]              |        |
    +-----------------------------------------------------------------------+
    ^        ^                                                     ^
    |        |                                                     |
    |        |                                                     |
    |   pending_pos = 512                                  producer_pos = 3584
    |
overwrite_pos = 0
consumer_pos = 0

6. Submit event B.

    The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
    which is the oldest event under writing now.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                   |        |
    |    A   |        B        |                 C                 |        |
    |        |                 |               [BUSY]              |        |
    +-----------------------------------------------------------------------+
    ^                          ^                                   ^
    |                          |                                   |
    |                          |                                   |
    |                     pending_pos = 1536               producer_pos = 3584
    |
overwrite_pos = 0
consumer_pos = 0

7. Reserve event D, size 1536 (3 * 512).

    There are 2048 bytes not under writing between producer_pos and pending_pos,
    so D is allocated at offset 3584, and producer_pos is moved from 3584 to
    5120.

    Since event D will overwrite all bytes of event A and the begining 512 bytes
    of event B, overwrite_pos is moved to the start of event C, the oldest event
    that is not overwritten.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |                 |        |                                   |        |
    |      D End      |        |                 C                 | D Begin|
    |      [BUSY]     |        |               [BUSY]              | [BUSY] |
    +-----------------------------------------------------------------------+
    ^                 ^        ^
    |                 |        |
    |                 |   pending_pos = 1536
    |                 |   overwrite_pos = 1536
    |                 |
    |             producer_pos=5120
    |
consumer_pos = 0

8. Reserve event E, size 1024.

    Though there are 512 bytes not under writing between producer_pos and
    pending_pos, E can not be reserved, as it would overwrite the first 512
    bytes of event C, which is still under writing.

9. Submit event C and D.

    pending_pos is moved to the end of D.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |                 |        |                                   |        |
    |      D End      |        |                 C                 | D Begin|
    |                 |        |                                   |        |
    +-----------------------------------------------------------------------+
    ^                 ^        ^
    |                 |        |
    |                 |   overwrite_pos = 1536
    |                 |
    |             producer_pos=5120
    |             pending_pos=5120
    |
consumer_pos = 0