[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzZK7gkQ2B3XguXjs2SALLmV54WXCdGPhkh6VE3s0J-WVg@mail.gmail.com>
Date: Thu, 8 Jun 2023 14:20:00 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: ast@...nel.org, andrii@...nel.org, martin.lau@...ux.dev,
razor@...ckwall.org, sdf@...gle.com, john.fastabend@...il.com,
kuba@...nel.org, dxu@...uu.xyz, joe@...ium.io, toke@...nel.org,
davem@...emloft.net, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v2 2/7] bpf: Add fd-based tcx multi-prog infra
with link support
On Wed, Jun 7, 2023 at 12:27 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> This work refactors and adds a lightweight extension ("tcx") to the tc BPF
> ingress and egress data path side for allowing BPF program management based
> on fds via bpf() syscall through the newly added generic multi-prog API.
> The main goal behind this work which we also presented at LPC [0] last year
> and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
> BPF link functionality for tc BPF programs, which allows for a model of safe
> ownership and program detachment.
>
> Given the rise in tc BPF users in cloud native environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications accidentally stepping on each others toes.
> As a recap, a BPF link represents the attachment of a BPF program to a BPF
> hook point. The BPF link holds a single reference to keep BPF program alive.
> Moreover, hook points do not reference a BPF link, only the application's
> fd or pinning does. A BPF link holds meta-data specific to attachment and
> implements operations for link creation, (atomic) BPF program update,
> detachment and introspection. The motivation for BPF links for tc BPF programs
> is multi-fold, for example:
>
> - From Meta: "It's especially important for applications that are deployed
> fleet-wide and that don't "control" hosts they are deployed to. If such
> application crashes and no one notices and does anything about that, BPF
> program will keep running draining resources or even just, say, dropping
> packets. We at FB had outages due to such permanent BPF attachment
> semantics. With fd-based BPF link we are getting a framework, which allows
> safe, auto-detachable behavior by default, unless application explicitly
> opts in by pinning the BPF link." [1]
>
> - From Cilium-side the tc BPF programs we attach to host-facing veth devices
> and phys devices build the core datapath for Kubernetes Pods, and they
> implement forwarding, load-balancing, policy, EDT-management, etc, within
> BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
> experienced hard-to-debug issues in a user's staging environment where
> another Kubernetes application using tc BPF attached to the same prio/handle
> of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
> it. The goal is to establish a clear/safe ownership model via links which
> cannot accidentally be overridden. [0,2]
>
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
>
> Earlier attempts [4] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
>
> We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
> attach API, so that the BPF link implementation blends in naturally similar to
> other link types which are fd-based and without the need for changing core tc
> internal APIs. BPF programs for tc can then be successively migrated from classic
> cls_bpf to the new tc BPF link without needing to change the program's source
> code, just the BPF loader mechanics for attaching is sufficient.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook have a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> entry point for tc BPF. The name tcx has been suggested from discussion of
> earlier revisions of this work as a good fit, and to more easily differ between
> the classic cls_bpf attachment and the fd-based one.
>
> For the ingress and egress tcx points, the device holds a cache-friendly array
> with program pointers which is separated from control plane (slow-path) data.
> Earlier versions of this work used priority to determine ordering and expression
> of dependencies similar as with classic tc, but it was challenged that for
> something more future-proof a better user experience is required. Hence this
> resulted in the design and development of the generic attach/detach/query API
> for multi-progs. See prior patch with its discussion on the API design. tcx is
> the first user and later we plan to integrate also others, for example, one
> candidate is multi-prog support for XDP which would benefit and have the same
> 'look and feel' from API perspective.
>
> The goal with tcx is to have maximum compatibility to existing tc BPF programs,
> so they don't need to be rewritten specifically. Compatibility to call into
> classic tcf_classify() is also provided in order to allow successive migration
> or both to cleanly co-exist where needed given its all one logical tc layer.
> tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
> to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
> The fd-based API is behind a static key, so that when unused the code is also
> not entered. The struct tcx_entry's program array is currently static, but
> could be made dynamic if necessary at a point in future. The a/b pair swap
> design has been chosen so that for detachment there are no allocations which
> otherwise could fail. The work has been tested with tc-testing selftest suite
> which all passes, as well as the tc BPF tests from the BPF CI, and also with
> Cilium's L4LB.
>
> Kudos also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
> of this work.
>
> [0] https://lpc.events/event/16/contributions/1353/
> [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com/
> [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
> [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
> [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com/
>
> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
> ---
> MAINTAINERS | 4 +-
> include/linux/netdevice.h | 15 +-
> include/linux/skbuff.h | 4 +-
> include/net/sch_generic.h | 2 +-
> include/net/tcx.h | 157 +++++++++++++++
> include/uapi/linux/bpf.h | 35 +++-
> kernel/bpf/Kconfig | 1 +
> kernel/bpf/Makefile | 1 +
> kernel/bpf/syscall.c | 95 +++++++--
> kernel/bpf/tcx.c | 347 +++++++++++++++++++++++++++++++++
> net/Kconfig | 5 +
> net/core/dev.c | 267 +++++++++++++++----------
> net/core/filter.c | 4 +-
> net/sched/Kconfig | 4 +-
> net/sched/sch_ingress.c | 45 ++++-
> tools/include/uapi/linux/bpf.h | 35 +++-
> 16 files changed, 877 insertions(+), 144 deletions(-)
> create mode 100644 include/net/tcx.h
> create mode 100644 kernel/bpf/tcx.c
>
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 207f8a37b327..e7584e24bc83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1035,6 +1035,8 @@ enum bpf_attach_type {
> BPF_TRACE_KPROBE_MULTI,
> BPF_LSM_CGROUP,
> BPF_STRUCT_OPS,
> + BPF_TCX_INGRESS,
> + BPF_TCX_EGRESS,
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -1052,7 +1054,7 @@ enum bpf_link_type {
> BPF_LINK_TYPE_KPROBE_MULTI = 8,
> BPF_LINK_TYPE_STRUCT_OPS = 9,
> BPF_LINK_TYPE_NETFILTER = 10,
> -
> + BPF_LINK_TYPE_TCX = 11,
> MAX_BPF_LINK_TYPE,
> };
>
> @@ -1559,13 +1561,13 @@ union bpf_attr {
> __u32 map_fd; /* struct_ops to attach */
> };
> union {
> - __u32 target_fd; /* object to attach to */
> - __u32 target_ifindex; /* target ifindex */
> + __u32 target_fd; /* target object to attach to or ... */
> + __u32 target_ifindex; /* target ifindex */
> };
> __u32 attach_type; /* attach type */
> __u32 flags; /* extra flags */
> union {
> - __u32 target_btf_id; /* btf_id of target to attach to */
> + __u32 target_btf_id; /* btf_id of target to attach to */
nit: should this part be in patch 1?
> struct {
> __aligned_u64 iter_info; /* extra bpf_iter_link_info */
> __u32 iter_info_len; /* iter_info length */
> @@ -1599,6 +1601,13 @@ union bpf_attr {
> __s32 priority;
> __u32 flags;
> } netfilter;
> + struct {
> + union {
> + __u32 relative_fd;
> + __u32 relative_id;
> + };
> + __u32 expected_revision;
> + } tcx;
> };
> } link_create;
>
[...]
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> + struct net *net = current->nsproxy->net_ns;
> + struct bpf_link_primer link_primer;
> + struct net_device *dev;
> + struct tcx_link *link;
> + int fd, err;
> +
> + dev = dev_get_by_index(net, attr->link_create.target_ifindex);
> + if (!dev)
> + return -EINVAL;
> + link = kzalloc(sizeof(*link), GFP_USER);
> + if (!link) {
> + err = -ENOMEM;
> + goto out_put;
> + }
> +
> + bpf_link_init(&link->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> + link->location = attr->link_create.attach_type;
> + link->flags = attr->link_create.flags & (BPF_F_FIRST | BPF_F_LAST);
> + link->dev = dev;
> +
> + err = bpf_link_prime(&link->link, &link_primer);
> + if (err) {
> + kfree(link);
> + goto out_put;
> + }
> + rtnl_lock();
> + err = tcx_link_prog_attach(&link->link, attr->link_create.flags,
> + attr->link_create.tcx.relative_fd,
> + attr->link_create.tcx.expected_revision);
> + if (!err)
> + fd = bpf_link_settle(&link_primer);
why this early settle? makes the error handling logic more convoluted.
Maybe leave link->dev as is and let bpf_link_cleanup() handle
dev_put(dev)? Can it be just:
err = tcx_link_prog_attach(...);
rtnl_unlock();
if (err) {
link->dev = NULL;
bpf_link_cleanup(&link_primer);
goto out_put;
}
dev_put(dev);
return bpf_link_settle(&link_primer);
?
> + rtnl_unlock();
> + if (err) {
> + link->dev = NULL;
> + bpf_link_cleanup(&link_primer);
> + goto out_put;
> + }
> + dev_put(dev);
> + return fd;
> +out_put:
> + dev_put(dev);
> + return err;
> +}
[...]
Powered by blists - more mailing lists