[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20191113204737.31623-1-bjorn.topel@gmail.com>
Date: Wed, 13 Nov 2019 21:47:33 +0100
From: Björn Töpel <bjorn.topel@...il.com>
To: netdev@...r.kernel.org, ast@...nel.org, daniel@...earbox.net
Cc: Björn Töpel <bjorn.topel@...il.com>,
bpf@...r.kernel.org, magnus.karlsson@...il.com,
magnus.karlsson@...el.com, jonathan.lemon@...il.com
Subject: [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher
This RFC(!) introduces the BPF dispatcher and xdp_call.h, and it's a
mechanism to avoid the retpoline overhead by text-poking/rewriting
indirect calls to direct calls.
The ideas build on Alexei's V3 of the BPF trampoline work, namely:
* Use the existing BPF JIT infrastructure generate code
* Use bpf_arch_text_poke() to modify the kernel text
To try the series out, you'll need V3 of the BPF trampoline work [1].
The main idea; Each XDP call-site calls the jited dispatch table,
instead of an indirect call. The dispatch table calls the XDP programs
directly. In pseudo code this be something similar to:
unsigned int do_call(struct bpf_prog *prog, struct xdp_buff *xdp)
{
if (&prog == PROG1)
return call_direct_PROG1(xdp);
if (&prog == PROG2)
return call_direct_PROG2(xdp);
return indirect_call(prog, xdp);
}
The current dispatcher supports four entries. It could support more,
but I don't know if it's really practical (...and I was lazy -- more
than 4 entries meant moving to >1B Jcc. :-P). The dispatcher is
re-generated for each new XDP program/entry. The upper limit of four
in this series means that if six i40e netdevs have an XDP program
running, the fifth and sixth will be using an indirect call.
Now to the performance numbers. I ran this on my 3 GHz Skylake, 64B
UDP packets are sent to the i40e at ~40 Mpps.
Benchmark:
# ./xdp_rxq_info --dev enp134s0f0 --action XDP_DROP
1. Baseline: 26.0 Mpps
2. Dispatcher 1 entry: 35,5 Mpps (+36.5%)
3. Dispatcher 4 enties: 32.9 Mpps (+26.5%)
4. Dispatcher 5 enties: 24.2 Mpps (-6.9%)
Scenario 4 is that the benchmark uses the dispatcher, but the table is
full. This means that the caller pays for the dispatching *and* the
retpoline.
Is this a good idea? The performance is nice! Can it be done in a
better way? Useful for other BPF programs? I would love your input!
Thanks!
Björn
[1] https://patchwork.ozlabs.org/cover/1191672/
Björn Töpel (4):
bpf: teach bpf_arch_text_poke() jumps
bpf: introduce BPF dispatcher
xdp: introduce xdp_call
i40e: start using xdp_call.h
arch/x86/net/bpf_jit_comp.c | 130 ++++++++++++-
drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 +-
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 5 +-
include/linux/bpf.h | 3 +
include/linux/xdp_call.h | 49 +++++
kernel/bpf/Makefile | 1 +
kernel/bpf/dispatcher.c | 197 ++++++++++++++++++++
8 files changed, 388 insertions(+), 7 deletions(-)
create mode 100644 include/linux/xdp_call.h
create mode 100644 kernel/bpf/dispatcher.c
--
2.20.1
Powered by blists - more mailing lists