[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZLV29eeWRETGkE02@google.com>
Date: Mon, 17 Jul 2023 10:14:29 -0700
From: Stanislav Fomichev <sdf@...gle.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: ast@...nel.org, andrii@...nel.org, martin.lau@...ux.dev,
razor@...ckwall.org, john.fastabend@...il.com, kuba@...nel.org, dxu@...uu.xyz,
joe@...ium.io, toke@...nel.org, davem@...emloft.net, bpf@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v5 1/8] bpf: Add generic attach/detach/query API
for multi-progs
On 07/14, Daniel Borkmann wrote:
> This adds a generic layer called bpf_mprog which can be reused by different
> attachment layers to enable multi-program attachment and dependency resolution.
> In-kernel users of the bpf_mprog don't need to care about the dependency
> resolution internals, they can just consume it with few API calls.
>
> The initial idea of having a generic API sparked out of discussion [0] from an
> earlier revision of this work where tc's priority was reused and exposed via
> BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
> as-is for classic tc BPF. The feedback was that priority provides a bad user
> experience and is hard to use [1], e.g.:
>
> I cannot help but feel that priority logic copy-paste from old tc, netfilter
> and friends is done because "that's how things were done in the past". [...]
> Priority gets exposed everywhere in uapi all the way to bpftool when it's
> right there for users to understand. And that's the main problem with it.
>
> The user don't want to and don't need to be aware of it, but uapi forces them
> to pick the priority. [...] Your cover letter [0] example proves that in
> real life different service pick the same priority. They simply don't know
> any better. Priority is an unnecessary magic that apps _have_ to pick, so
> they just copy-paste and everyone ends up using the same.
>
> The course of the discussion showed more and more the need for a generic,
> reusable API where the "same look and feel" can be applied for various other
> program types beyond just tc BPF, for example XDP today does not have multi-
> program support in kernel, but also there was interest around this API for
> improving management of cgroup program types. Such common multi-program
> management concept is useful for BPF management daemons or user space BPF
> applications coordinating internally about their attachments.
>
> Both from Cilium and Meta side [2], we've collected the following requirements
> for a generic attach/detach/query API for multi-progs which has been implemented
> as part of this work:
>
> - Support prog-based attach/detach and link API
> - Dependency directives (can also be combined):
> - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
> - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
> space application does not need CAP_SYS_ADMIN to retrieve foreign fds
> via bpf_*_get_fd_by_id()
> - BPF_F_LINK flag as {prog,link} toggle
> - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
> BPF_F_AFTER will just append for attaching
> - Enforced only at attach time
> - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
> own infra for replacing their internal prog
> - If no flags are set, then it's default append behavior for attaching
> - Internal revision counter and optionally being able to pass expected_revision
> - User space application can query current state with revision, and pass it
> along for attachment to assert current state before doing updates
> - Query also gets extension for link_ids array and link_attach_flags:
> - prog_ids are always filled with program IDs
> - link_ids are filled with link IDs when link was used, otherwise 0
> - {prog,link}_attach_flags for holding {prog,link}-specific flags
> - Must be easy to integrate/reuse for in-kernel users
>
> The uapi-side changes needed for supporting bpf_mprog are rather minimal,
> consisting of the additions of the attachment flags, revision counter, and
> expanding existing union with relative_{fd,id} member.
>
> The bpf_mprog framework consists of an bpf_mprog_entry object which holds
> an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
> structure) is part of bpf_mprog_bundle. Both have been separated, so that
> fast-path gets efficient packing of bpf_prog pointers for maximum cache
> efficiency. Also, array has been chosen instead of linked list or other
> structures to remove unnecessary indirections for a fast point-to-entry in
> tc for BPF.
>
> The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
> updates the peer bpf_mprog_entry is populated and then just swapped which
> avoids additional allocations that could otherwise fail, for example, in
> detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
> be converted to dynamic allocation if necessary at a point in future.
> Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
> of tcx which uses this API in the next patch, it piggybacks on rtnl.
>
> An extensive test suite for checking all aspects of this API for prog-based
> attach/detach and link API comes as BPF selftests in this series.
>
> Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
> management.
>
> [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
> [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
> [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>
> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
> ---
> MAINTAINERS | 1 +
> include/linux/bpf_mprog.h | 318 ++++++++++++++++++++++++
> include/uapi/linux/bpf.h | 36 ++-
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/mprog.c | 439 +++++++++++++++++++++++++++++++++
> tools/include/uapi/linux/bpf.h | 36 ++-
> 6 files changed, 815 insertions(+), 17 deletions(-)
> create mode 100644 include/linux/bpf_mprog.h
> create mode 100644 kernel/bpf/mprog.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d1fc5d26c699..4f3b8139c6ff 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3684,6 +3684,7 @@ F: include/linux/filter.h
> F: include/linux/tnum.h
> F: kernel/bpf/core.c
> F: kernel/bpf/dispatcher.c
> +F: kernel/bpf/mprog.c
> F: kernel/bpf/syscall.c
> F: kernel/bpf/tnum.c
> F: kernel/bpf/trampoline.c
> diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
> new file mode 100644
> index 000000000000..6feefec43422
> --- /dev/null
> +++ b/include/linux/bpf_mprog.h
> @@ -0,0 +1,318 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __BPF_MPROG_H
> +#define __BPF_MPROG_H
> +
> +#include <linux/bpf.h>
> +
> +/* bpf_mprog framework:
> + *
> + * bpf_mprog is a generic layer for multi-program attachment. In-kernel users
> + * of the bpf_mprog don't need to care about the dependency resolution
> + * internals, they can just consume it with few API calls. Currently available
> + * dependency directives are BPF_F_{BEFORE,AFTER} which enable insertion of
> + * a BPF program or BPF link relative to an existing BPF program or BPF link
> + * inside the multi-program array as well as prepend and append behavior if
> + * no relative object was specified, see corresponding selftests for concrete
> + * examples (e.g. tc_links and tc_opts test cases of test_progs).
> + *
> + * Usage of bpf_mprog_{attach,detach,query}() core APIs with pseudo code:
> + *
> + * Attach case:
> + *
> + * struct bpf_mprog_entry *entry, *entry_new;
> + * int ret;
> + *
> + * // bpf_mprog user-side lock
> + * // fetch active @entry from attach location
> + * [...]
> + * ret = bpf_mprog_attach(entry, &entry_new, [...]);
> + * if (!ret) {
> + * if (entry != entry_new) {
> + * // swap @entry to @entry_new at attach location
> + * // ensure there are no inflight users of @entry:
> + * synchronize_rcu();
> + * }
> + * bpf_mprog_commit(entry);
> + * } else {
> + * // error path, bail out, propagate @ret
> + * }
> + * // bpf_mprog user-side unlock
> + *
> + * Detach case:
> + *
> + * struct bpf_mprog_entry *entry, *entry_new;
> + * int ret;
> + *
> + * // bpf_mprog user-side lock
> + * // fetch active @entry from attach location
> + * [...]
> + * ret = bpf_mprog_detach(entry, &entry_new, [...]);
> + * if (!ret) {
> + * // all (*) marked is optional and depends on the use-case
> + * // whether bpf_mprog_bundle should be freed or not
> + * if (!bpf_mprog_total(entry_new)) (*)
> + * entry_new = NULL (*)
> + * // swap @entry to @entry_new at attach location
> + * // ensure there are no inflight users of @entry:
> + * synchronize_rcu();
> + * bpf_mprog_commit(entry);
> + * if (!entry_new) (*)
> + * // free bpf_mprog_bundle (*)
> + * } else {
> + * // error path, bail out, propagate @ret
> + * }
> + * // bpf_mprog user-side unlock
> + *
> + * Query case:
> + *
> + * struct bpf_mprog_entry *entry;
> + * int ret;
> + *
> + * // bpf_mprog user-side lock
> + * // fetch active @entry from attach location
> + * [...]
> + * ret = bpf_mprog_query(attr, uattr, entry);
> + * // bpf_mprog user-side unlock
> + *
> + * Data/fast path:
> + *
> + * struct bpf_mprog_entry *entry;
> + * struct bpf_mprog_fp *fp;
> + * struct bpf_prog *prog;
> + * int ret = [...];
> + *
> + * rcu_read_lock();
> + * // fetch active @entry from attach location
> + * [...]
> + * bpf_mprog_foreach_prog(entry, fp, prog) {
> + * ret = bpf_prog_run(prog, [...]);
> + * // process @ret from program
> + * }
> + * [...]
> + * rcu_read_unlock();
> + *
> + * bpf_mprog locking considerations:
> + *
> + * bpf_mprog_{attach,detach,query}() must be protected by an external lock
> + * (like RTNL in case of tcx).
> + *
> + * bpf_mprog_entry pointer can be an __rcu annotated pointer (in case of tcx
> + * the netdevice has tcx_ingress and tcx_egress __rcu pointer) which gets
> + * updated via rcu_assign_pointer() pointing to the active bpf_mprog_entry of
> + * the bpf_mprog_bundle.
> + *
> + * Fast path accesses the active bpf_mprog_entry within RCU critical section
> + * (in case of tcx it runs in NAPI which provides RCU protection there,
> + * other users might need explicit rcu_read_lock()). The bpf_mprog_commit()
> + * assumes that for the old bpf_mprog_entry there are no inflight users
> + * anymore.
> + *
> + * The READ_ONCE()/WRITE_ONCE() pairing for bpf_mprog_fp's prog access is for
> + * the replacement case where we don't swap the bpf_mprog_entry.
> + */
> +
> +#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \
> + for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\
> + ({ \
> + t.prog = READ_ONCE(fp->prog); \
> + t.link = cp->link; \
> + t.prog; \
> + }); \
> + fp++, cp++)
> +
> +#define bpf_mprog_foreach_prog(entry, fp, p) \
> + for (fp = &entry->fp_items[0]; \
> + (p = READ_ONCE(fp->prog)); \
> + fp++)
> +
> +#define BPF_MPROG_MAX 64
> +
> +struct bpf_mprog_fp {
> + struct bpf_prog *prog;
> +};
> +
> +struct bpf_mprog_cp {
> + struct bpf_link *link;
> +};
> +
> +struct bpf_mprog_entry {
> + struct bpf_mprog_fp fp_items[BPF_MPROG_MAX];
> + struct bpf_mprog_bundle *parent;
> +};
> +
> +struct bpf_mprog_bundle {
> + struct bpf_mprog_entry a;
> + struct bpf_mprog_entry b;
> + struct bpf_mprog_cp cp_items[BPF_MPROG_MAX];
> + struct bpf_prog *ref;
> + atomic64_t revision;
> + u32 count;
> +};
> +
> +struct bpf_tuple {
> + struct bpf_prog *prog;
> + struct bpf_link *link;
> +};
> +
> +static inline struct bpf_mprog_entry *
> +bpf_mprog_peer(const struct bpf_mprog_entry *entry)
> +{
> + if (entry == &entry->parent->a)
> + return &entry->parent->b;
> + else
> + return &entry->parent->a;
> +}
> +
> +static inline void bpf_mprog_bundle_init(struct bpf_mprog_bundle *bundle)
> +{
> + BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
> + BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) !=
> + ARRAY_SIZE(bundle->cp_items));
> +
> + memset(bundle, 0, sizeof(*bundle));
> + atomic64_set(&bundle->revision, 1);
> + bundle->a.parent = bundle;
> + bundle->b.parent = bundle;
> +}
> +
> +static inline void bpf_mprog_inc(struct bpf_mprog_entry *entry)
> +{
> + entry->parent->count++;
> +}
> +
> +static inline void bpf_mprog_dec(struct bpf_mprog_entry *entry)
> +{
> + entry->parent->count--;
> +}
> +
> +static inline int bpf_mprog_max(void)
> +{
> + return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
> +}
> +
> +static inline int bpf_mprog_total(struct bpf_mprog_entry *entry)
> +{
> + int total = entry->parent->count;
> +
> + WARN_ON_ONCE(total > bpf_mprog_max());
> + return total;
> +}
> +
> +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
> + struct bpf_prog *prog)
> +{
> + const struct bpf_mprog_fp *fp;
> + const struct bpf_prog *tmp;
> +
> + bpf_mprog_foreach_prog(entry, fp, tmp) {
> + if (tmp == prog)
> + return true;
> + }
> + return false;
> +}
> +
> +static inline void bpf_mprog_mark_for_release(struct bpf_mprog_entry *entry,
> + struct bpf_tuple *tuple)
> +{
> + WARN_ON_ONCE(entry->parent->ref);
> + if (!tuple->link)
> + entry->parent->ref = tuple->prog;
> +}
> +
> +static inline void bpf_mprog_complete_release(struct bpf_mprog_entry *entry)
> +{
> + /* In the non-link case prog deletions can only drop the reference
> + * to the prog after the bpf_mprog_entry got swapped and the
> + * bpf_mprog ensured that there are no inflight users anymore.
> + *
> + * Paired with bpf_mprog_mark_for_release().
> + */
> + if (entry->parent->ref) {
> + bpf_prog_put(entry->parent->ref);
> + entry->parent->ref = NULL;
> + }
> +}
> +
> +static inline void bpf_mprog_revision_new(struct bpf_mprog_entry *entry)
> +{
> + atomic64_inc(&entry->parent->revision);
> +}
> +
> +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
> +{
> + bpf_mprog_complete_release(entry);
> + bpf_mprog_revision_new(entry);
> +}
> +
> +static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
> +{
> + return atomic64_read(&entry->parent->revision);
> +}
> +
> +static inline void bpf_mprog_entry_copy(struct bpf_mprog_entry *dst,
> + struct bpf_mprog_entry *src)
> +{
> + memcpy(dst->fp_items, src->fp_items, sizeof(src->fp_items));
> +}
> +
> +static inline void bpf_mprog_entry_grow(struct bpf_mprog_entry *entry, int idx)
> +{
> + int total = bpf_mprog_total(entry);
> +
> + memmove(entry->fp_items + idx + 1,
> + entry->fp_items + idx,
> + (total - idx) * sizeof(struct bpf_mprog_fp));
> +
> + memmove(entry->parent->cp_items + idx + 1,
> + entry->parent->cp_items + idx,
> + (total - idx) * sizeof(struct bpf_mprog_cp));
> +}
> +
> +static inline void bpf_mprog_entry_shrink(struct bpf_mprog_entry *entry, int idx)
> +{
> + /* Total array size is needed in this case to enure the NULL
> + * entry is copied at the end.
> + */
> + int total = ARRAY_SIZE(entry->fp_items);
> +
> + memmove(entry->fp_items + idx,
> + entry->fp_items + idx + 1,
> + (total - idx - 1) * sizeof(struct bpf_mprog_fp));
> +
> + memmove(entry->parent->cp_items + idx,
> + entry->parent->cp_items + idx + 1,
> + (total - idx - 1) * sizeof(struct bpf_mprog_cp));
> +}
> +
> +static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 idx,
> + struct bpf_mprog_fp **fp,
> + struct bpf_mprog_cp **cp)
> +{
> + *fp = &entry->fp_items[idx];
> + *cp = &entry->parent->cp_items[idx];
> +}
> +
> +static inline void bpf_mprog_write(struct bpf_mprog_fp *fp,
> + struct bpf_mprog_cp *cp,
> + struct bpf_tuple *tuple)
> +{
> + WRITE_ONCE(fp->prog, tuple->prog);
> + cp->link = tuple->link;
> +}
> +
> +int bpf_mprog_attach(struct bpf_mprog_entry *entry,
> + struct bpf_mprog_entry **entry_new,
> + struct bpf_prog *prog_new, struct bpf_link *link,
> + struct bpf_prog *prog_old,
> + u32 flags, u32 id_or_fd, u64 revision);
> +
> +int bpf_mprog_detach(struct bpf_mprog_entry *entry,
> + struct bpf_mprog_entry **entry_new,
> + struct bpf_prog *prog, struct bpf_link *link,
> + u32 flags, u32 id_or_fd, u64 revision);
> +
> +int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
> + struct bpf_mprog_entry *entry);
> +
> +#endif /* __BPF_MPROG_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 600d0caebbd8..1c166870cdf3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1113,7 +1113,12 @@ enum bpf_perf_event_type {
> */
> #define BPF_F_ALLOW_OVERRIDE (1U << 0)
> #define BPF_F_ALLOW_MULTI (1U << 1)
> +/* Generic attachment flags. */
> #define BPF_F_REPLACE (1U << 2)
> +#define BPF_F_BEFORE (1U << 3)
> +#define BPF_F_AFTER (1U << 4)
> +#define BPF_F_ID (1U << 5)
> +#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
>
> /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> * verifier will perform strict alignment checking as if the kernel
> @@ -1444,14 +1449,19 @@ union bpf_attr {
> };
>
> struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> - __u32 target_fd; /* container object to attach to */
> - __u32 attach_bpf_fd; /* eBPF program to attach */
> + union {
> + __u32 target_fd; /* target object to attach to or ... */
> + __u32 target_ifindex; /* target ifindex */
> + };
> + __u32 attach_bpf_fd;
> __u32 attach_type;
> __u32 attach_flags;
> - __u32 replace_bpf_fd; /* previously attached eBPF
> - * program to replace if
> - * BPF_F_REPLACE is used
> - */
> + __u32 replace_bpf_fd;
> + union {
> + __u32 relative_fd;
> + __u32 relative_id;
> + };
> + __u64 expected_revision;
> };
>
> struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1497,16 +1507,26 @@ union bpf_attr {
> } info;
>
> struct { /* anonymous struct used by BPF_PROG_QUERY command */
> - __u32 target_fd; /* container object to query */
> + union {
> + __u32 target_fd; /* target object to query or ... */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_type;
> __u32 query_flags;
> __u32 attach_flags;
> __aligned_u64 prog_ids;
> - __u32 prog_cnt;
> + union {
> + __u32 prog_cnt;
> + __u32 count;
> + };
> + __u32 :32;
> /* output: per-program attach_flags.
> * not allowed to be set during effective query.
> */
> __aligned_u64 prog_attach_flags;
> + __aligned_u64 link_ids;
> + __aligned_u64 link_attach_flags;
> + __u64 revision;
> } query;
>
> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1d3892168d32..1bea2eb912cd 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
> -obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
> obj-$(CONFIG_BPF_JIT) += trampoline.o
> obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
> obj-$(CONFIG_BPF_JIT) += dispatcher.o
> diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
> new file mode 100644
> index 000000000000..f6c0b015114f
> --- /dev/null
> +++ b/kernel/bpf/mprog.c
> @@ -0,0 +1,439 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +
> +static int bpf_mprog_link(struct bpf_tuple *tuple,
> + u32 id_or_fd, u32 flags,
> + enum bpf_prog_type type)
> +{
> + bool id = flags & BPF_F_ID;
> + struct bpf_link *link;
> +
> + if (id)
> + link = bpf_link_by_id(id_or_fd);
> + else
> + link = bpf_link_get_from_fd(id_or_fd);
> + if (IS_ERR(link))
> + return PTR_ERR(link);
> + if (type && link->prog->type != type) {
> + bpf_link_put(link);
> + return -EINVAL;
> + }
> +
> + tuple->link = link;
> + tuple->prog = link->prog;
> + return 0;
> +}
> +
> +static int bpf_mprog_prog(struct bpf_tuple *tuple,
> + u32 id_or_fd, u32 flags,
> + enum bpf_prog_type type)
> +{
> + bool id = flags & BPF_F_ID;
> + struct bpf_prog *prog;
> +
> + if (id)
> + prog = bpf_prog_by_id(id_or_fd);
> + else
> + prog = bpf_prog_get(id_or_fd);
> + if (IS_ERR(prog)) {
> + if (!id_or_fd && !id)
> + return 0;
Andrii was asking whether we need to explicitly reject zero in [0] and
we've been chatting with Alexei about the same in [1]. So are we trying
to be more flexible here on purpose or should we outright return -EINVAL
for 0 id_or_fd? (or am I misreading this part?)
The rest looks great, thanks for the docs indeed!
0: https://lore.kernel.org/bpf/CAEf4Bza_X30yLPm0Lhy2c-u1Qw1Ci9AVoy5jo_XXCaT9zz+3jg@mail.gmail.com/
1: https://lore.kernel.org/bpf/CAKH8qBsr5vYijQSVv0EO8TF7zfoAdAaWC8jpVKK_nGSgAoyiQg@mail.gmail.com/#t
Powered by blists - more mailing lists