[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221104032532.1615099-1-sdf@google.com>
Date: Thu, 3 Nov 2022 20:25:18 -0700
From: Stanislav Fomichev <sdf@...gle.com>
To: bpf@...r.kernel.org
Cc: ast@...nel.org, daniel@...earbox.net, andrii@...nel.org,
martin.lau@...ux.dev, song@...nel.org, yhs@...com,
john.fastabend@...il.com, kpsingh@...nel.org, sdf@...gle.com,
haoluo@...gle.com, jolsa@...nel.org,
David Ahern <dsahern@...il.com>,
Jakub Kicinski <kuba@...nel.org>,
Willem de Bruijn <willemb@...gle.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Anatoly Burakov <anatoly.burakov@...el.com>,
Alexander Lobakin <alexandr.lobakin@...el.com>,
Magnus Karlsson <magnus.karlsson@...il.com>,
Maryam Tahhan <mtahhan@...hat.com>, xdp-hints@...-project.net,
netdev@...r.kernel.org
Subject: [RFC bpf-next v2 00/14] xdp: hints via kfuncs
Changes since the original RFC:
- 'struct bpf_patch' to more easily manage insn arrays
While working on longer unrolled functions I've bumped into verifier's
insn_buf[16]. Instead of increasing it, I've added a simple
abstraction that handles resizing.
*insn++ = BPF_XXX_XXX();
*insn++ = BPF_YYY_YYY();
vs
bpf_patch_append(patch,
BPF_XXX_XXX(),
BPF_YYY_YYY(),
);
There are also some tricks where we resolve BPF_JMP_IMM(op, dst, imm,
S16_MAX) to the real end of the program; mostly to make it easy to generate
the following:
if (some_condition) {
if (some_other_condition) {
// insns
}
}
- Drop xdp_buff->priv in favor of container_of (Jakub)
The drivers might need to maintain more data, so instead of
constraining them to a single ->priv pointer in xdp_buff, use container_of
and require the users to define the outer struct. xdp_buff should
be the first member.
Each driver now has two patches: the first one introduces new struct
wrapper; the second one does the actual work.
- bpf_xdp_metadata_export_to_skb (Toke)
To solve the case where we're constructing skb from a redirected
frame. bpf_xdp_metadata_export_to_skb is an unrolled kfunc
that prepares 'struct xdp_to_skb_metadata' in the metadata.
The layout is randomized and it has a randomized magic number
to make sure userspace doesn't depend on it. I'm not sure how
strict we should be here? Toke/Jesper seem to be ok with
userspace depending on this but as long as they read the
layout via BTF, so maybe having a fixed magic is ok/better her?
Later, at skb_metadata_set time, we call into
skb_metadata_import_from_xdp that tries to parse this fixed format
and extract relevant metadata.
Note that because we are unrolling bpf_xdp_metadata_export_to_skb,
we have to constrain other kfuncs to R2-R5 only; I've added the
reasoning to the comments. It might be better idea to do a real
kfunc call (but we have to generate this kfunc anyway)?
- helper to make it easier to call kernel code from unrolled kfuncs
Since we're unrolling, we're running in a somewhat constrained
environment. R6-R9 belong to the real callee, so we can't use them
to stash our R1-R5. We also can't use stack in any way. Another
constraint comes from bpf_xdp_metadata_export_to_skb which
might call several metadata kfuncs and wants its R1 to be preserved.
Thus, we add xdp_kfunc_call_preserving_r1, which generates the
bytecode to preserve r1 somewhere in the user-supplied context.
Again, we have to do this because we unroll bpf_xdp_metadata_export_to_skb.
- mlx4/bnxt/ice examples (Jesper)
ice is the only one where it seems feasible to unroll. The code is
untested, but at least it shows that it's somewhat possible to
get to the timestamp in our R2-R5-constrained environment.
mlx4/bnxt do spinlocks/seqlocks, so I'm just calling into the kernel
code.
- Unroll default implementation to return false/0/NULL
Original RFC left default kfuncs calls when the driver doesn't
do the unrolling. Here, instead, we unroll to single 'return 0'
instruction.
- Drop prog_ifindex libbpf patch, use bpf_program__set_ifindex instead (Andrii)
- Keep returning metadata by value instead of using a pointer
I've tried using the pointer, it works currently, but it requires extra
argument per commit eb1f7f71c126 ("bpf/verifier: allow kfunc to return
an allocated mem"). The requirement can probably be lifted for our
case, but not sure it's necessary for now.
While adding rx_timestamp support for the drivers, it turns out
we never really return the raw pointer to the descriptor field. We read
the descriptor field, do shifts/multiplies, convert to kernel clock,
etc/etc. So returning a value instead of a pointer seems more logical,
at least for the rx timestamp case. For the other types of metadata,
we might reconsider.
- rename bpf_xdp_metadata_have_<xxx> to bpf_xdp_metadata_<xxx>_supported
Spotted in one of Toke's emails. Seems like it better conveys
the intent that it actually tests that the device supports the
metadata, not that the particular packet has the metadata.
The following is unchanged since the original RFC:
This is an RFC for the alternative approach suggested by Martin and
Jakub. I've tried to CC most of the people from the original discussion,
feel free to add more if you think I've missed somebody.
Summary:
- add new BPF_F_XDP_HAS_METADATA program flag and abuse
attr->prog_ifindex to pass target device ifindex at load time
- at load time, find appropriate ndo_unroll_kfunc and call
it to unroll/inline kfuncs; kfuncs have the default "safe"
implementation if unrolling is not supported by a particular
device
- rewrite xskxceiver test to use C bpf program and extend
it to export rx_timestamp (plus add rx timestamp to veth driver)
I've intentionally kept it small and hacky to see whether the approach is
workable or not.
Pros:
- we avoid BTF complexity; the BPF programs themselves are now responsible
for agreeing on the metadata layout with the AF_XDP consumer
- the metadata is free if not used
- the metadata should, in theory, be cheap if used; kfuncs should be
unrolled to the same code as if the metadata was pre-populated and
passed with a BTF id
- it's not all or nothing; users can use small subset of metadata which
is more efficient than the BTF id approach where all metadata has to be
exposed for every frame (and selectively consumed by the users)
Cons:
- forwarding has to be handled explicitly; the BPF programs have to
agree on the metadata layout (IOW, the forwarding program
has to be aware of the final AF_XDP consumer metadata layout)
- TX picture is not clear; but it's not clear with BTF ids as well;
I think we've agreed that just reusing whatever we have at RX
won't fly at TX; seems like TX XDP program might be the answer
here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata
back into the kernel)
Cc: John Fastabend <john.fastabend@...il.com>
Cc: David Ahern <dsahern@...il.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Jakub Kicinski <kuba@...nel.org>
Cc: Willem de Bruijn <willemb@...gle.com>
Cc: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Anatoly Burakov <anatoly.burakov@...el.com>
Cc: Alexander Lobakin <alexandr.lobakin@...el.com>
Cc: Magnus Karlsson <magnus.karlsson@...il.com>
Cc: Maryam Tahhan <mtahhan@...hat.com>
Cc: xdp-hints@...-project.net
Cc: netdev@...r.kernel.org
Stanislav Fomichev (14):
bpf: Introduce bpf_patch
bpf: Support inlined/unrolled kfuncs for xdp metadata
veth: Introduce veth_xdp_buff wrapper for xdp_buff
veth: Support rx timestamp metadata for xdp
selftests/bpf: Verify xdp_metadata xdp->af_xdp path
xdp: Carry over xdp metadata into skb context
selftests/bpf: Verify xdp_metadata xdp->skb path
bpf: Helper to simplify calling kernel routines from unrolled kfuncs
ice: Introduce ice_xdp_buff wrapper for xdp_buff
ice: Support rx timestamp metadata for xdp
mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
mxl4: Support rx timestamp metadata for xdp
bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff
bnxt: Support rx timestamp metadata for xdp
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 73 +++-
drivers/net/ethernet/intel/ice/ice.h | 5 +
drivers/net/ethernet/intel/ice/ice_main.c | 1 +
drivers/net/ethernet/intel/ice/ice_txrx.c | 105 ++++-
.../net/ethernet/mellanox/mlx4/en_netdev.c | 1 +
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 66 ++-
drivers/net/veth.c | 89 ++--
include/linux/bpf.h | 1 +
include/linux/bpf_patch.h | 29 ++
include/linux/btf.h | 1 +
include/linux/btf_ids.h | 4 +
include/linux/mlx4/device.h | 7 +
include/linux/netdevice.h | 5 +
include/linux/skbuff.h | 4 +
include/net/xdp.h | 41 ++
include/uapi/linux/bpf.h | 5 +
kernel/bpf/Makefile | 2 +-
kernel/bpf/bpf_patch.c | 81 ++++
kernel/bpf/syscall.c | 28 +-
kernel/bpf/verifier.c | 85 ++++
net/core/dev.c | 7 +
net/core/skbuff.c | 25 ++
net/core/xdp.c | 165 +++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/testing/selftests/bpf/Makefile | 2 +-
.../selftests/bpf/prog_tests/xdp_metadata.c | 394 ++++++++++++++++++
.../selftests/bpf/progs/xdp_metadata.c | 78 ++++
27 files changed, 1244 insertions(+), 65 deletions(-)
create mode 100644 include/linux/bpf_patch.h
create mode 100644 kernel/bpf/bpf_patch.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
--
2.38.1.431.g37b22c650d-goog
Powered by blists - more mailing lists