[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzZqHo0kOa1Zc-syy9GZHUhEHEK0_0zLxFFpMhSZUc2_Qg@mail.gmail.com>
Date: Mon, 27 Oct 2025 19:47:52 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yhs@...com>,
Song Liu <song@...nel.org>
Subject: Re: [PATCH bpf-next v3 1/3] bpf: Add overwrite mode for BPF ring buffer
On Fri, Oct 17, 2025 at 9:04 PM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@...wei.com>
>
> When the BPF ring buffer is full, a new event cannot be recorded until one
> or more old events are consumed to make enough space for it. In cases such
> as fault diagnostics, where recent events are more useful than older ones,
> this mechanism may lead to critical events being lost.
>
> So add overwrite mode for BPF ring buffer to address it. In this mode, the
> new event overwrites the oldest event when the buffer is full.
>
> The basic idea is as follows:
>
> 1. producer_pos tracks the next position to record new event. When there
> is enough free space, producer_pos is simply advanced by producer to
> make space for the new event.
>
> 2. To avoid waiting for consumer when the buffer is full, a new variable,
> overwrite_pos, is introduced for producer. It points to the oldest event
> committed in the buffer. It is advanced by producer to discard one or more
> oldest events to make space for the new event when the buffer is full.
>
> 3. pending_pos tracks the oldest event to be committed. pending_pos is never
> passed by producer_pos, so multiple producers never write to the same
> position at the same time.
>
> The following example diagrams show how it works in a 4096-byte ring buffer.
>
> 1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | |
> | |
> | |
> +-----------------------------------------------------------------------+
> ^
> |
> |
> producer_pos = 0
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 2. Now reserve a 512-byte event A.
>
> There is enough free space, so A is allocated at offset 0. And producer_pos
> is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is
> set.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | |
> | A | |
> | [BUSY] | |
> +-----------------------------------------------------------------------+
> ^ ^
> | |
> | |
> | producer_pos = 512
> |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 3. Reserve event B, size 1024.
>
> B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced
> to the end of B.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | | |
> | A | B | |
> | [BUSY] | [BUSY] | |
> +-----------------------------------------------------------------------+
> ^ ^
> | |
> | |
> | producer_pos = 1536
> |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 4. Reserve event C, size 2048.
>
> C is allocated at offset 1536, and producer_pos is advanced to 3584.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | | | |
> | A | B | C | |
> | [BUSY] | [BUSY] | [BUSY] | |
> +-----------------------------------------------------------------------+
> ^ ^
> | |
> | |
> | producer_pos = 3584
> |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 5. Submit event A.
>
> The BUSY bit of A is cleared. B becomes the oldest event to be committed, so
> pending_pos is advanced to 512, the start of B.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | | | |
> | A | B | C | |
> | | [BUSY] | [BUSY] | |
> +-----------------------------------------------------------------------+
> ^ ^ ^
> | | |
> | | |
> | pending_pos = 512 producer_pos = 3584
> |
> overwrite_pos = 0
> consumer_pos = 0
>
> 6. Submit event B.
>
> The BUSY bit of B is cleared, and pending_pos is advanced to the start of C,
> which is now the oldest event to be committed.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | | | |
> | A | B | C | |
> | | | [BUSY] | |
> +-----------------------------------------------------------------------+
> ^ ^ ^
> | | |
> | | |
> | pending_pos = 1536 producer_pos = 3584
> |
> overwrite_pos = 0
> consumer_pos = 0
>
> 7. Reserve event D, size 1536 (3 * 512).
>
> There are 2048 bytes not being written between producer_pos (currently 3584)
> and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced
> by 1536 (from 3584 to 5120).
>
> Since event D will overwrite all bytes of event A and the first 512 bytes of
> event B, overwrite_pos is advanced to the start of event C, the oldest event
> that is not overwritten.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | | | |
> | D End | | C | D Begin|
> | [BUSY] | | [BUSY] | [BUSY] |
> +-----------------------------------------------------------------------+
> ^ ^ ^
> | | |
> | | pending_pos = 1536
> | | overwrite_pos = 1536
> | |
> | producer_pos=5120
> |
> consumer_pos = 0
>
> 8. Reserve event E, size 1024.
>
> Although there are 512 bytes not being written between producer_pos and
> pending_pos, E cannot be reserved, as it would overwrite the first 512
> bytes of event C, which is still being written.
>
> 9. Submit event C and D.
>
> pending_pos is advanced to the end of D.
>
> 0 512 1024 1536 2048 2560 3072 3584 4096
> +-----------------------------------------------------------------------+
> | | | | |
> | D End | | C | D Begin|
> | | | | |
> +-----------------------------------------------------------------------+
> ^ ^ ^
> | | |
> | | overwrite_pos = 1536
> | |
> | producer_pos=5120
> | pending_pos=5120
> |
> consumer_pos = 0
>
> The performance data for overwrite mode will be provided in a follow-up
> patch that adds overwrite-mode benchmarks.
>
> A sample of performance data for non-overwrite mode, collected on an x86_64
> CPU and an arm64 CPU, before and after this patch, is shown below. As we can
> see, no obvious performance regression occurs.
>
> - x86_64 (AMD EPYC 9654)
>
> Before:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1 11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2 15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3 7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4 6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8 2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>
> After:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1 11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2 15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3 8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4 6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8 2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>
> - arm64 (HiSilicon Kunpeng 920)
>
> Before:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1 11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2 9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3 6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4 4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8 3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s)
>
> After:
>
> Ringbuf, multi-producer contention
> ==================================
> rb-libbpf nr_prod 1 11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 2 9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 3 6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 4 5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 8 3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s)
>
> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
> ---
> include/uapi/linux/bpf.h | 4 ++
> kernel/bpf/ringbuf.c | 109 +++++++++++++++++++++++++++------
> tools/include/uapi/linux/bpf.h | 4 ++
> 3 files changed, 98 insertions(+), 19 deletions(-)
>
[...]
> @@ -72,6 +73,8 @@ struct bpf_ringbuf {
> */
> unsigned long consumer_pos __aligned(PAGE_SIZE);
> unsigned long producer_pos __aligned(PAGE_SIZE);
> + /* points to the record right after the last overwritten one */
> + unsigned long overwrite_pos;
I moved this after pending_pos, as all these fields are actually
exposed to the user space, so didn't want to unnecessarily shift
pending_pos.
> unsigned long pending_pos;
> char data[] __aligned(PAGE_SIZE);
> };
> @@ -166,7 +169,7 @@ static void bpf_ringbuf_notify(struct irq_work *work)
> * considering that the maximum value of data_sz is (4GB - 1), there
> * will be no overflow, so just note the size limit in the comments.
> */
> -static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
> +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node, bool overwrite_mode)
> {
> struct bpf_ringbuf *rb;
>
> @@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
> rb->consumer_pos = 0;
> rb->producer_pos = 0;
> rb->pending_pos = 0;
> + rb->overwrite_mode = overwrite_mode;
>
> return rb;
> }
>
> static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
> {
> + bool overwrite_mode = false;
> struct bpf_ringbuf_map *rb_map;
>
> if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
> return ERR_PTR(-EINVAL);
>
> + if (attr->map_flags & BPF_F_RB_OVERWRITE) {
> + if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF)
this seemed error prone if we ever add another ringbuf type (unlikely,
but still), so I inverted this all to allow BPF_F_RB_OVERWRITE only
for BPF_MAP_TYPE_RINGBUF. We should try to be as strict as possible by
default.
> + return ERR_PTR(-EINVAL);
> + overwrite_mode = true;
> + }
> +
> if (attr->key_size || attr->value_size ||
> !is_power_of_2(attr->max_entries) ||
> !PAGE_ALIGNED(attr->max_entries))
> @@ -205,7 +216,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
>
> bpf_map_init_from_attr(&rb_map->map, attr);
>
> - rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
> + rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node, overwrite_mode);
> if (!rb_map->rb) {
> bpf_map_area_free(rb_map);
> return ERR_PTR(-ENOMEM);
> @@ -293,13 +304,25 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma
> return remap_vmalloc_range(vma, rb_map->rb, vma->vm_pgoff + RINGBUF_PGOFF);
> }
>
> +/* Return an estimate of the available data in the ring buffer.
Fixed up comment style
[...]
> static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
> @@ -402,11 +425,41 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
> return (void*)((addr & PAGE_MASK) - off);
> }
>
> +static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb,
> + unsigned long new_prod_pos,
> + unsigned long cons_pos,
> + unsigned long pend_pos)
> +{
> + /* no space if oldest not yet committed record until the newest
> + * record span more than (ringbuf_size - 1).
> + */
same, keep in mind that we now use kernel-wide comment style with /*
on separate line. Fixed up all other places as well.
> + if (new_prod_pos - pend_pos > rb->mask)
> + return false;
> +
> + /* ok, we have space in overwrite mode */
> + if (unlikely(rb->overwrite_mode))
> + return true;
> +
> + /* no space if producer position advances more than (ringbuf_size - 1)
> + * ahead of consumer position when not in overwrite mode.
> + */
> + if (new_prod_pos - cons_pos > rb->mask)
> + return false;
> +
> + return true;
> +}
> +
[...]
Powered by blists - more mailing lists