[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYo4um9WHCgJvNVwLPVS-vmEORPgKBBKN-Gmbe=fVgS+Q@mail.gmail.com>
Date: Wed, 5 Oct 2022 17:22:27 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: bpf@...r.kernel.org, razor@...ckwall.org, ast@...nel.org,
andrii@...nel.org, martin.lau@...ux.dev, john.fastabend@...il.com,
joannelkoong@...il.com, memxor@...il.com, toke@...hat.com,
joe@...ium.io, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc
BPF programs
On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> This work refactors and adds a lightweight extension to the tc BPF ingress
> and egress data path side for allowing BPF programs via an fd-based attach /
> detach API. The main goal behind this work which we also presented at LPC [0]
> this year is to eventually add support for BPF links for tc BPF programs in
> a second step, thus this prep work is required for the latter which allows
> for a model of safe ownership and program detachment. Given the vast rise
> in tc BPF users in cloud native / Kubernetes environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications stepping on each others toes. Further
> details for BPF link rationale in next patch.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook gets a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> tc-layer entry point for BPF. As part of the feedback from LPC, there was
> a suggestion to provide a name for this infrastructure to more easily differ
> between the classic cls_bpf attachment and the fd-based API. As for most,
> the XDP vs tc layer is already the default mental model for the pkt processing
> pipeline. We refactored this with an xtc internal prefix aka 'express traffic
> control' in order to avoid to deviate too far (and 'express' given its more
> lightweight/faster entry point).
>
> For the ingress and egress xtc points, the device holds a cache-friendly array
> with programs. Same as with classic tc, programs are attached with a prio that
> can be specified or auto-allocated through an idr, and the program return code
> determines whether to continue in the pipeline or to terminate processing.
> With TC_ACT_UNSPEC code, the processing continues (as the case today). The goal
> was to have maximum compatibility to existing tc BPF programs, so they don't
> need to be adapted. Compatibility to call into classic tcf_classify() is also
> provided in order to allow successive migration or both to cleanly co-exist
> where needed given its one logical layer. The fd-based API is behind a static
> key, so that when unused the code is also not entered. The struct xtc_entry's
> program array is currently static, but could be made dynamic if necessary at
> a point in future. Desire has also been expressed for future work to adapt
> similar framework for XDP to allow multi-attach from in-kernel side, too.
>
> Tested with tc-testing selftest suite which all passes, as well as the tc BPF
> tests from the BPF CI.
>
> [0] https://lpc.events/event/16/contributions/1353/
>
> Co-developed-by: Nikolay Aleksandrov <razor@...ckwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@...ckwall.org>
> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
> ---
> MAINTAINERS | 4 +-
> include/linux/bpf.h | 1 +
> include/linux/netdevice.h | 14 +-
> include/linux/skbuff.h | 4 +-
> include/net/sch_generic.h | 2 +-
> include/net/xtc.h | 181 ++++++++++++++++++++++
> include/uapi/linux/bpf.h | 35 ++++-
> kernel/bpf/Kconfig | 1 +
> kernel/bpf/Makefile | 1 +
> kernel/bpf/net.c | 274 +++++++++++++++++++++++++++++++++
> kernel/bpf/syscall.c | 24 ++-
> net/Kconfig | 5 +
> net/core/dev.c | 262 +++++++++++++++++++------------
> net/core/filter.c | 4 +-
> net/sched/Kconfig | 4 +-
> net/sched/sch_ingress.c | 48 +++++-
> tools/include/uapi/linux/bpf.h | 35 ++++-
> 17 files changed, 769 insertions(+), 130 deletions(-)
> create mode 100644 include/net/xtc.h
> create mode 100644 kernel/bpf/net.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e55a4d47324c..bb63d8d000ea 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3850,13 +3850,15 @@ S: Maintained
> F: kernel/trace/bpf_trace.c
> F: kernel/bpf/stackmap.c
>
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (xtc & tc BPF, sock_addr)
> M: Martin KaFai Lau <martin.lau@...ux.dev>
> M: Daniel Borkmann <daniel@...earbox.net>
> R: John Fastabend <john.fastabend@...il.com>
> L: bpf@...r.kernel.org
> L: netdev@...r.kernel.org
> S: Maintained
> +F: include/net/xtc.h
> +F: kernel/bpf/net.c
> F: net/core/filter.c
> F: net/sched/act_bpf.c
> F: net/sched/cls_bpf.c
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..71e5f43db378 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
> union {
> struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
> u64 bpf_cookie;
> + u32 bpf_priority;
So this looks unfortunate and unnecessary. You are basically saying no
BPF cookie for this new TC/XTC/TCX thingy. But there is no need, we
already reserve 2 * 8 bytes for cgroup_storage, so make bpf_cookie and
bpf_prio co-exist with
union {
struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
struct {
u64 bpf_cookie;
u32 bpf_priority;
};
}
or is there some problem with that?
> };
> };
>
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
> BPF_PERF_EVENT,
> BPF_TRACE_KPROBE_MULTI,
> BPF_LSM_CGROUP,
> + BPF_NET_INGRESS,
> + BPF_NET_EGRESS,
I can bikeshedding as well :) Shouldn't this be TC/TCX-specific attach
types? Wouldn't BPF_[X]TC[X]_INGRESS/BPF_[X]TC[X]_EGRESS be more
appropriate? Because when you think about it XDP is also NET, right,
so I find NET meaning really TC a bit confusing.
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -1399,14 +1401,20 @@ union bpf_attr {
> };
>
> struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> - __u32 target_fd; /* container object to attach to */
> + union {
> + __u32 target_fd; /* container object to attach to */
> + __u32 target_ifindex; /* target ifindex */
> + };
this makes total sense (target can be FD or ifindex, we have that in
LINK_CREATE as well)
> __u32 attach_bpf_fd; /* eBPF program to attach */
> __u32 attach_type;
> __u32 attach_flags;
> - __u32 replace_bpf_fd; /* previously attached eBPF
> + union {
> + __u32 attach_priority;
> + __u32 replace_bpf_fd; /* previously attached eBPF
> * program to replace if
> * BPF_F_REPLACE is used
> */
> + };
But this union seems 1) unnecessary (we don't have to save those 4
bytes), but also 2) wouldn't it make sense to support replace_bpf_fd
with BPF_F_REPLACE (at given prio, if I understand correctly). It's
equivalent situation to what we had in cgroup programs land before we
got bpf_link. So staying consistent makes sense, unless I missed
something?
> };
>
> struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
> } info;
>
> struct { /* anonymous struct used by BPF_PROG_QUERY command */
> - __u32 target_fd; /* container object to query */
> + union {
> + __u32 target_fd; /* container object to query */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_type;
> __u32 query_flags;
> __u32 attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
> };
> };
>
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> + TC_NEXT = -1,
> + TC_PASS = 0,
> + TC_DROP = 2,
> + TC_REDIRECT = 7,
> +};
> +
> struct bpf_xdp_sock {
> __u32 queue_id;
> };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
> __be32 flow_label;
> };
>
> +struct bpf_query_info {
this is something that's returned from BPF_PROG_QUERY command, right?
Shouldn't it be called bpf_prog_query_info or something like that?
Just "query_info" is very generic, IMO, but if we are sure that there
will never be any other "QUERY" command, I guess it might be fine.
> + __u32 prog_id;
> + __u32 prio;
> +};
> +
> struct bpf_func_info {
> __u32 insn_off;
> __u32 type_id;
[...]
Powered by blists - more mailing lists