[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6a2da12-6442-4a8e-a5dc-6f8af5a5178c@kernel.org>
Date: Wed, 11 Sep 2024 10:32:56 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Daniel Xu <dxu@...uu.xyz>, Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>,
Eduard Zingerman <eddyz87@...il.com>, Andrii Nakryiko <andrii@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Alexei Starovoitov <ast@...nel.org>,
Shuah Khan <shuah@...nel.org>, John Fastabend <john.fastabend@...il.com>,
Martin KaFai Lau <martin.lau@...ux.dev>, Song Liu <song@...nel.org>,
Yonghong Song <yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
Jiri Olsa <jolsa@...nel.org>, Mykola Lysenko <mykolal@...com>,
LKML <linux-kernel@...r.kernel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>,
Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next] bpf: ringbuf: Support consuming
BPF_MAP_TYPE_RINGBUF from prog
On 11/09/2024 06.43, Daniel Xu wrote:
> [cc Jesper]
>
> On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
>> On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
>>> On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@...uu.xyz> wrote:
>>>>
>>>> On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
>>>>> On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@...uu.xyz> wrote:
>>>>>>
[...cut...]
>>> Can you give us a bit more details on what
>>> you are trying to achieve?
>>
>> BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
>> entry in the cpumap. When a prog redirects to an entry in the cpumap,
>> the machinery queues up the xdp frame onto the destination CPU ptr_ring.
>> This can occur on any cpu, thus multi-producer. On processing side,
>> there is only the kthread created by the cpumap entry and bound to the
>> specific cpu that is consuming entries. So single consumer.
>>
An important detail: to get Multi-Producer (MP) to scale the CPUMAP does
bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU
array and does the flush/enqueue as part of the xdp_do_flush(). Because
I was afraid of this adding latency, I choose to also flush every 8
frames (CPU_MAP_BULK_SIZE).
Looking at code I see this is also explained in a comment:
/* General idea: XDP packets getting XDP redirected to another CPU,
* will maximum be stored/queued for one driver ->poll() call. It is
* guaranteed that queueing the frame and the flush operation happen on
* same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
* which queue in bpf_cpu_map_entry contains packets.
*/
>> Goal is to track the latency overhead added from ptr_ring and the
>> kthread (versus softirq where is less overhead). Ideally we want p50,
>> p90, p95, p99 percentiles.
>>
I'm very interesting in this use-case of understanding the latency of
CPUMAP.
I'm a fan of latency histograms that I turn into heatmaps in grafana.
>> To do this, we need to track every single entry enqueue time as well as
>> dequeue time - events that occur in the tail are quite important.
>>
>> Since ptr_ring is also a ring buffer, I thought it would be easy,
>> reliable, and fast to just create a "shadow" ring buffer. Every time
>> producer enqueues entries, I'd enqueue the same number of current
>> timestamp onto shadow RB. Same thing on consumer side, except dequeue
>> and calculate timestamp delta.
>>
This idea seems overkill and will likely produce unreliable results.
E.g. the overhead of this additional ring buffer will also affect the
measurements.
>> I was originally planning on writing my own lockless ring buffer in pure
>> BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
>> could avoid that with this patch.
>
> [...]
>
> Alternatively, could add a u64 timestamp to xdp_frame, which makes all
> this tracking inline (and thus more reliable). But I'm not sure how precious
> the space in that struct is - I see some references online saying most drivers
> save 128B headroom. I also see:
>
> #define XDP_PACKET_HEADROOM 256
>
I like the inline idea. I would suggest to add u64 timestamp into
XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in
RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program
that pickup this timestamp, and then calc a delta from "now" timestamp.
[1]
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77
> Could probably amortize the timestamp read by setting it in
> bq_flush_to_queue().
To amortize, consider that you might not need to timestamp EVERY packet
to get sufficient statistics on the latency.
Regarding bq_flush_to_queue() and the enqueue tracepoint:
trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
I have an idea for you, on how to measure the latency overhead from XDP
RX-processing to when enqueue "flush" happens. It is a little tricky to
explain, so I will outline the steps.
1. XDP bpf_prog store timestamp in per-CPU array,
unless timestamp is already set.
2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp
and calc latency diff, and clears timestamp.
This measures the latency overhead of bulk enqueue. (Notice: Only the
first XDP redirect frame after a bq_flush_to_queue() will set the
timestamp). This per-CPU store should work as this all runs under same
RX-NAPI "poll" execution.
This latency overhead of bulk enqueue, will (unfortunately) also
count/measure the XDP_PASS packets that gets processed by the normal
netstack. So, watch out for this. e.g could have XDP actions (e.g
XDP_PASS) counters as part of step 1, and have statistic for cases where
XDP_PASS interfered.
--Jesper
Powered by blists - more mailing lists