linux-kernel - Re: [PATCH bpf-next v3 1/3] bpf: Add overwrite mode for BPF ring buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzZqHo0kOa1Zc-syy9GZHUhEHEK0_0zLxFFpMhSZUc2_Qg@mail.gmail.com>
Date: Mon, 27 Oct 2025 19:47:52 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yhs@...com>, 
	Song Liu <song@...nel.org>
Subject: Re: [PATCH bpf-next v3 1/3] bpf: Add overwrite mode for BPF ring buffer

On Fri, Oct 17, 2025 at 9:04 PM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@...wei.com>
>
> When the BPF ring buffer is full, a new event cannot be recorded until one
> or more old events are consumed to make enough space for it. In cases such
> as fault diagnostics, where recent events are more useful than older ones,
> this mechanism may lead to critical events being lost.
>
> So add overwrite mode for BPF ring buffer to address it. In this mode, the
> new event overwrites the oldest event when the buffer is full.
>
> The basic idea is as follows:
>
> 1. producer_pos tracks the next position to record new event. When there
>    is enough free space, producer_pos is simply advanced by producer to
>    make space for the new event.
>
> 2. To avoid waiting for consumer when the buffer is full, a new variable,
>    overwrite_pos, is introduced for producer. It points to the oldest event
>    committed in the buffer. It is advanced by producer to discard one or more
>    oldest events to make space for the new event when the buffer is full.
>
> 3. pending_pos tracks the oldest event to be committed. pending_pos is never
>    passed by producer_pos, so multiple producers never write to the same
>    position at the same time.
>
> The following example diagrams show how it works in a 4096-byte ring buffer.
>
> 1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |                                                                       |
>    |                                                                       |
>    |                                                                       |
>    +-----------------------------------------------------------------------+
>    ^
>    |
>    |
> producer_pos = 0
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 2. Now reserve a 512-byte event A.
>
>    There is enough free space, so A is allocated at offset 0. And producer_pos
>    is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is
>    set.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |        |                                                              |
>    |   A    |                                                              |
>    | [BUSY] |                                                              |
>    +-----------------------------------------------------------------------+
>    ^        ^
>    |        |
>    |        |
>    |    producer_pos = 512
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 3. Reserve event B, size 1024.
>
>    B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced
>    to the end of B.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |        |                 |                                            |
>    |   A    |        B        |                                            |
>    | [BUSY] |      [BUSY]     |                                            |
>    +-----------------------------------------------------------------------+
>    ^                          ^
>    |                          |
>    |                          |
>    |                   producer_pos = 1536
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 4. Reserve event C, size 2048.
>
>    C is allocated at offset 1536, and producer_pos is advanced to 3584.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |        |                 |                                   |        |
>    |    A   |        B        |                 C                 |        |
>    | [BUSY] |      [BUSY]     |               [BUSY]              |        |
>    +-----------------------------------------------------------------------+
>    ^                                                              ^
>    |                                                              |
>    |                                                              |
>    |                                                    producer_pos = 3584
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 5. Submit event A.
>
>    The BUSY bit of A is cleared. B becomes the oldest event to be committed, so
>    pending_pos is advanced to 512, the start of B.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |        |                 |                                   |        |
>    |    A   |        B        |                 C                 |        |
>    |        |      [BUSY]     |               [BUSY]              |        |
>    +-----------------------------------------------------------------------+
>    ^        ^                                                     ^
>    |        |                                                     |
>    |        |                                                     |
>    |   pending_pos = 512                                  producer_pos = 3584
>    |
> overwrite_pos = 0
> consumer_pos = 0
>
> 6. Submit event B.
>
>    The BUSY bit of B is cleared, and pending_pos is advanced to the start of C,
>    which is now the oldest event to be committed.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |        |                 |                                   |        |
>    |    A   |        B        |                 C                 |        |
>    |        |                 |               [BUSY]              |        |
>    +-----------------------------------------------------------------------+
>    ^                          ^                                   ^
>    |                          |                                   |
>    |                          |                                   |
>    |                     pending_pos = 1536               producer_pos = 3584
>    |
> overwrite_pos = 0
> consumer_pos = 0
>
> 7. Reserve event D, size 1536 (3 * 512).
>
>    There are 2048 bytes not being written between producer_pos (currently 3584)
>    and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced
>    by 1536 (from 3584 to 5120).
>
>    Since event D will overwrite all bytes of event A and the first 512 bytes of
>    event B, overwrite_pos is advanced to the start of event C, the oldest event
>    that is not overwritten.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |                 |        |                                   |        |
>    |      D End      |        |                 C                 | D Begin|
>    |      [BUSY]     |        |               [BUSY]              | [BUSY] |
>    +-----------------------------------------------------------------------+
>    ^                 ^        ^
>    |                 |        |
>    |                 |   pending_pos = 1536
>    |                 |   overwrite_pos = 1536
>    |                 |
>    |             producer_pos=5120
>    |
> consumer_pos = 0
>
> 8. Reserve event E, size 1024.
>
>    Although there are 512 bytes not being written between producer_pos and
>    pending_pos, E cannot be reserved, as it would overwrite the first 512
>    bytes of event C, which is still being written.
>
> 9. Submit event C and D.
>
>    pending_pos is advanced to the end of D.
>
>    0       512      1024    1536     2048     2560     3072     3584       4096
>    +-----------------------------------------------------------------------+
>    |                 |        |                                   |        |
>    |      D End      |        |                 C                 | D Begin|
>    |                 |        |                                   |        |
>    +-----------------------------------------------------------------------+
>    ^                 ^        ^
>    |                 |        |
>    |                 |   overwrite_pos = 1536
>    |                 |
>    |             producer_pos=5120
>    |             pending_pos=5120
>    |
> consumer_pos = 0
>
> The performance data for overwrite mode will be provided in a follow-up
> patch that adds overwrite-mode benchmarks.
>
> A sample of performance data for non-overwrite mode, collected on an x86_64
> CPU and an arm64 CPU, before and after this patch, is shown below. As we can
> see, no obvious performance regression occurs.
>
> - x86_64 (AMD EPYC 9654)
>
> Before:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1  11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2  15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3  7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4  6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8  2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>
> After:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1  11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2  15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3  8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4  6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8  2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>
> - arm64 (HiSilicon Kunpeng 920)
>
> Before:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1  11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2  9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3  6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4  4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8  3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s)
>
> After:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1  11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2  9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3  6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4  5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8  3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s)
>
> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
> ---
>  include/uapi/linux/bpf.h       |   4 ++
>  kernel/bpf/ringbuf.c           | 109 +++++++++++++++++++++++++++------
>  tools/include/uapi/linux/bpf.h |   4 ++
>  3 files changed, 98 insertions(+), 19 deletions(-)
>

[...]

> @@ -72,6 +73,8 @@ struct bpf_ringbuf {
>          */
>         unsigned long consumer_pos __aligned(PAGE_SIZE);
>         unsigned long producer_pos __aligned(PAGE_SIZE);
> +       /* points to the record right after the last overwritten one */
> +       unsigned long overwrite_pos;

I moved this after pending_pos, as all these fields are actually
exposed to the user space, so didn't want to unnecessarily shift
pending_pos.

>         unsigned long pending_pos;
>         char data[] __aligned(PAGE_SIZE);
>  };
> @@ -166,7 +169,7 @@ static void bpf_ringbuf_notify(struct irq_work *work)
>   * considering that the maximum value of data_sz is (4GB - 1), there
>   * will be no overflow, so just note the size limit in the comments.
>   */
> -static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
> +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node, bool overwrite_mode)
>  {
>         struct bpf_ringbuf *rb;
>
> @@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
>         rb->consumer_pos = 0;
>         rb->producer_pos = 0;
>         rb->pending_pos = 0;
> +       rb->overwrite_mode = overwrite_mode;
>
>         return rb;
>  }
>
>  static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
>  {
> +       bool overwrite_mode = false;
>         struct bpf_ringbuf_map *rb_map;
>
>         if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
>                 return ERR_PTR(-EINVAL);
>
> +       if (attr->map_flags & BPF_F_RB_OVERWRITE) {
> +               if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF)

this seemed error prone if we ever add another ringbuf type (unlikely,
but still), so I inverted this all to allow BPF_F_RB_OVERWRITE only
for BPF_MAP_TYPE_RINGBUF. We should try to be as strict as possible by
default.

> +                       return ERR_PTR(-EINVAL);
> +               overwrite_mode = true;
> +       }
> +
>         if (attr->key_size || attr->value_size ||
>             !is_power_of_2(attr->max_entries) ||
>             !PAGE_ALIGNED(attr->max_entries))
> @@ -205,7 +216,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
>
>         bpf_map_init_from_attr(&rb_map->map, attr);
>
> -       rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
> +       rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node, overwrite_mode);
>         if (!rb_map->rb) {
>                 bpf_map_area_free(rb_map);
>                 return ERR_PTR(-ENOMEM);
> @@ -293,13 +304,25 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma
>         return remap_vmalloc_range(vma, rb_map->rb, vma->vm_pgoff + RINGBUF_PGOFF);
>  }
>
> +/* Return an estimate of the available data in the ring buffer.

Fixed up comment style

[...]

>  static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
> @@ -402,11 +425,41 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
>         return (void*)((addr & PAGE_MASK) - off);
>  }
>
> +static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb,
> +                                 unsigned long new_prod_pos,
> +                                 unsigned long cons_pos,
> +                                 unsigned long pend_pos)
> +{
> +       /* no space if oldest not yet committed record until the newest
> +        * record span more than (ringbuf_size - 1).
> +        */

same, keep in mind that we now use kernel-wide comment style with /*
on separate line. Fixed up all other places as well.

> +       if (new_prod_pos - pend_pos > rb->mask)
> +               return false;
> +
> +       /* ok, we have space in overwrite mode */
> +       if (unlikely(rb->overwrite_mode))
> +               return true;
> +
> +       /* no space if producer position advances more than (ringbuf_size - 1)
> +        * ahead of consumer position when not in overwrite mode.
> +        */
> +       if (new_prod_pos - cons_pos > rb->mask)
> +               return false;
> +
> +       return true;
> +}
> +

[...]