[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <921201ff-8c8d-b523-5df6-3326f6cd0fd9@fb.com>
Date: Mon, 23 Dec 2019 19:57:48 +0000
From: Yonghong Song <yhs@...com>
To: Martin Lau <kafai@...com>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>
CC: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
David Miller <davem@...emloft.net>,
Kernel Team <Kernel-team@...com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
> is a kernel struct with its func ptr implemented in bpf prog.
> This new map is the interface to register/unregister/introspect
> a bpf implemented kernel struct.
>
> The kernel struct is actually embedded inside another new struct
> (or called the "value" struct in the code). For example,
> "struct tcp_congestion_ops" is embbeded in:
> struct bpf_struct_ops_tcp_congestion_ops {
> refcount_t refcnt;
> enum bpf_struct_ops_state state;
> struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */
> }
> The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> The "bpftool map dump" will then be able to show the
> state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> number of tcp_sock in the tcp_congestion_ops case). This "value" struct
> is created automatically by a macro. Having a separate "value" struct
> will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> initialization works before registering the struct_ops to the kernel
> subsystem). The libbpf will take care of finding and populating the
> "struct bpf_struct_ops_XYZ" from "struct XYZ".
>
> Register a struct_ops to a kernel subsystem:
> 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> running kernel.
> Instead of reusing the attr->btf_value_type_id,
> btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> used as the "user" btf which could store other useful sysadmin/debug
> info that may be introduced in the furture,
> e.g. creation-date/compiler-details/map-creator...etc.
> 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> in the running kernel btf. Populate the value of this object.
> The function ptr should be populated with the prog fds.
> 4. Call BPF_MAP_UPDATE with the object created in (3) as
> the map value. The key is always "0".
>
> During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> args as an array of u64 is generated. BPF_MAP_UPDATE also allows
> the specific struct_ops to do some final checks in "st_ops->init_member()"
> (e.g. ensure all mandatory func ptrs are implemented).
> If everything looks good, it will register this kernel struct
> to the kernel subsystem. The map will not allow further update
> from this point.
>
> Unregister a struct_ops from the kernel subsystem:
> BPF_MAP_DELETE with key "0".
>
> Introspect a struct_ops:
> BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will
> have the prog _id_ populated as the func ptr.
>
> The map value state (enum bpf_struct_ops_state) will transit from:
> INIT (map created) =>
> INUSE (map updated, i.e. reg) =>
> TOBEFREE (map value deleted, i.e. unreg)
>
> The kernel subsystem needs to call bpf_struct_ops_get() and
> bpf_struct_ops_put() to manage the "refcnt" in the
> "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
> for the purose of tracking the subsystem usage. Another approach
> is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> the map-fd/pinned-map usage. However, that will also tie down the
> future semantics of map->refcnt and map->usercnt.
>
> The very first subsystem's refcnt (during reg()) holds one
> count to map->refcnt. When the very last subsystem's refcnt
> is gone, it will also release the map->refcnt. All bpf_prog will be
> freed when the map->refcnt reaches 0 (i.e. during map_free()).
>
> Here is how the bpftool map command will look like:
> [root@...h-fb-vm1 bpf]# bpftool map show
> 6: struct_ops name dctcp flags 0x0
> key 4B value 256B max_entries 1 memlock 4096B
> btf_id 6
> [root@...h-fb-vm1 bpf]# bpftool map dump id 6
> [{
> "value": {
> "refcnt": {
> "refs": {
> "counter": 1
> }
> },
> "state": 1,
The bpftool dump with "state" 1 is a little bit cryptic.
Since this is common for all struct_ops maps, can we
make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE?
> "data": {
> "list": {
> "next": 0,
> "prev": 0
> },
> "key": 0,
> "flags": 2,
> "init": 24,
> "release": 0,
> "ssthresh": 25,
> "cong_avoid": 30,
> "set_state": 27,
> "cwnd_event": 28,
> "in_ack_event": 26,
> "undo_cwnd": 29,
> "pkts_acked": 0,
> "min_tso_segs": 0,
> "sndbuf_expand": 0,
> "cong_control": 0,
> "get_info": 0,
> "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
> ],
> "owner": 0
> }
> }
> }
> ]
>
> Misc Notes:
> * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
> It does an inplace update on "*value" instead returning a pointer
> to syscall.c. Otherwise, it needs a separate copy of "zero" value
> for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
>
> * The bpf_struct_ops_map_delete_elem() is also called without
> preempt_disable() from map_delete_elem(). It is because
> the "->unreg()" may requires sleepable context, e.g.
> the "tcp_unregister_congestion_control()".
>
> * "const" is added to some of the existing "struct btf_func_model *"
> function arg to avoid a compiler warning caused by this patch.
>
> Signed-off-by: Martin KaFai Lau <kafai@...com>
> ---
> arch/x86/net/bpf_jit_comp.c | 11 +-
> include/linux/bpf.h | 49 +++-
> include/linux/bpf_types.h | 3 +
> include/linux/btf.h | 13 +
> include/uapi/linux/bpf.h | 7 +-
> kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
> kernel/bpf/btf.c | 20 +-
> kernel/bpf/map_in_map.c | 3 +-
> kernel/bpf/syscall.c | 49 ++--
> kernel/bpf/trampoline.c | 5 +-
> kernel/bpf/verifier.c | 5 +
> 11 files changed, 593 insertions(+), 40 deletions(-)
>
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c1eeb3e0e116..38059880963e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -136,6 +136,7 @@ enum bpf_map_type {
> BPF_MAP_TYPE_STACK,
> BPF_MAP_TYPE_SK_STORAGE,
> BPF_MAP_TYPE_DEVMAP_HASH,
> + BPF_MAP_TYPE_STRUCT_OPS,
> };
>
> /* Note that tracing related programs such as
> @@ -398,6 +399,10 @@ union bpf_attr {
> __u32 btf_fd; /* fd pointing to a BTF type data */
> __u32 btf_key_type_id; /* BTF type_id of the key */
> __u32 btf_value_type_id; /* BTF type_id of the value */
> + __u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
> + * struct stored as the
> + * map value
> + */
> };
>
> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> @@ -3350,7 +3355,7 @@ struct bpf_map_info {
> __u32 map_flags;
> char name[BPF_OBJ_NAME_LEN];
> __u32 ifindex;
> - __u32 :32;
> + __u32 btf_vmlinux_value_type_id;
> __u64 netns_dev;
> __u64 netns_ino;
> __u32 btf_id;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index c9f81bd1df83..fb9a0b3e4580 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -10,8 +10,66 @@
> #include <linux/seq_file.h>
> #include <linux/refcount.h>
>
> +enum bpf_struct_ops_state {
> + BPF_STRUCT_OPS_STATE_INIT,
> + BPF_STRUCT_OPS_STATE_INUSE,
> + BPF_STRUCT_OPS_STATE_TOBEFREE,
> +};
> +
> +#define BPF_STRUCT_OPS_COMMON_VALUE \
> + refcount_t refcnt; \
> + enum bpf_struct_ops_state state
> +
> +struct bpf_struct_ops_value {
> + BPF_STRUCT_OPS_COMMON_VALUE;
> + char data[0] ____cacheline_aligned_in_smp;
> +};
> +
> +struct bpf_struct_ops_map {
> + struct bpf_map map;
> + const struct bpf_struct_ops *st_ops;
> + /* protect map_update */
> + spinlock_t lock;
> + /* progs has all the bpf_prog that is populated
> + * to the func ptr of the kernel's struct
> + * (in kvalue.data).
> + */
> + struct bpf_prog **progs;
> + /* image is a page that has all the trampolines
> + * that stores the func args before calling the bpf_prog.
> + * A PAGE_SIZE "image" is enough to store all trampoline for
> + * "progs[]".
> + */
> + void *image;
> + /* uvalue->data stores the kernel struct
> + * (e.g. tcp_congestion_ops) that is more useful
> + * to userspace than the kvalue. For example,
> + * the bpf_prog's id is stored instead of the kernel
> + * address of a func ptr.
> + */
> + struct bpf_struct_ops_value *uvalue;
> + /* kvalue.data stores the actual kernel's struct
> + * (e.g. tcp_congestion_ops) that will be
> + * registered to the kernel subsystem.
> + */
> + struct bpf_struct_ops_value kvalue;
> +};
> +
> +#define VALUE_PREFIX "bpf_struct_ops_"
> +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
> +
> +/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
> + * the map's value exposed to the userspace and its btf-type-id is
> + * stored at the map->btf_vmlinux_value_type_id.
> + *
> + */
> #define BPF_STRUCT_OPS_TYPE(_name) \
> -extern struct bpf_struct_ops bpf_##_name;
> +extern struct bpf_struct_ops bpf_##_name; \
> + \
> +struct bpf_struct_ops_##_name { \
> + BPF_STRUCT_OPS_COMMON_VALUE; \
> + struct _name data ____cacheline_aligned_in_smp; \
> +};
> #include "bpf_struct_ops_types.h"
> #undef BPF_STRUCT_OPS_TYPE
>
> @@ -35,19 +93,51 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
> const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> };
>
> +static const struct btf_type *module_type;
> +
> void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> {
> + s32 type_id, value_id, module_id;
> const struct btf_member *member;
> struct bpf_struct_ops *st_ops;
> struct bpf_verifier_log log = {};
> const struct btf_type *t;
> + char value_name[128];
> const char *mname;
> - s32 type_id;
> u32 i, j;
>
> + /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */
> +#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name);
> +#include "bpf_struct_ops_types.h"
> +#undef BPF_STRUCT_OPS_TYPE
This looks great!
> +
> + module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
> + BTF_KIND_STRUCT);
> + if (module_id < 0) {
> + pr_warn("Cannot find struct module in btf_vmlinux\n");
> + return;
> + }
> + module_type = btf_type_by_id(_btf_vmlinux, module_id);
> +
> for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> st_ops = bpf_struct_ops[i];
>
> + if (strlen(st_ops->name) + VALUE_PREFIX_LEN >=
> + sizeof(value_name)) {
> + pr_warn("struct_ops name %s is too long\n",
> + st_ops->name);
> + continue;
> + }
> + sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name);
> +
> + value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
> + BTF_KIND_STRUCT);
> + if (value_id < 0) {
> + pr_warn("Cannot find struct %s in btf_vmlinux\n",
> + value_name);
> + continue;
> + }
> +
> type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
> BTF_KIND_STRUCT);
> if (type_id < 0) {
> @@ -99,6 +189,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> } else {
> st_ops->type_id = type_id;
> st_ops->type = t;
> + st_ops->value_id = value_id;
> + st_ops->value_type =
> + btf_type_by_id(_btf_vmlinux, value_id);
> }
> }
> }
> @@ -106,6 +199,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>
> extern struct btf *btf_vmlinux;
>
[...]
> +
> +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> + void *value, u64 flags)
> +{
> + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> + const struct bpf_struct_ops *st_ops = st_map->st_ops;
> + struct bpf_struct_ops_value *uvalue, *kvalue;
> + const struct btf_member *member;
> + const struct btf_type *t = st_ops->type;
> + void *udata, *kdata;
> + int prog_fd, err = 0;
> + void *image;
> + u32 i;
> +
> + if (flags)
> + return -EINVAL;
> +
> + if (*(u32 *)key != 0)
> + return -E2BIG;
> +
> + uvalue = (struct bpf_struct_ops_value *)value;
> + if (uvalue->state || refcount_read(&uvalue->refcnt))
> + return -EINVAL;
> +
> + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> + kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> +
> + spin_lock(&st_map->lock);
> +
> + if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> + err = -EBUSY;
> + goto unlock;
> + }
> +
> + memcpy(uvalue, value, map->value_size);
> +
> + udata = &uvalue->data;
> + kdata = &kvalue->data;
> + image = st_map->image;
> +
> + for_each_member(i, t, member) {
> + const struct btf_type *mtype, *ptype;
> + struct bpf_prog *prog;
> + u32 moff;
> +
> + moff = btf_member_bit_offset(t, member) / 8;
> + mtype = btf_type_by_id(btf_vmlinux, member->type);
> + ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> + if (ptype == module_type) {
> + *(void **)(kdata + moff) = BPF_MODULE_OWNER;
> + continue;
> + }
> +
> + err = st_ops->init_member(t, member, kdata, udata);
> + if (err < 0)
> + goto reset_unlock;
> +
> + /* The ->init_member() has handled this member */
> + if (err > 0)
> + continue;
> +
> + /* If st_ops->init_member does not handle it,
> + * we will only handle func ptrs and zero-ed members
> + * here. Reject everything else.
> + */
> +
> + /* All non func ptr member must be 0 */
> + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> + NULL)) {
> + u32 msize;
> +
> + mtype = btf_resolve_size(btf_vmlinux, mtype,
> + &msize, NULL, NULL);
> + if (IS_ERR(mtype)) {
> + err = PTR_ERR(mtype);
> + goto reset_unlock;
> + }
> +
> + if (memchr_inv(udata + moff, 0, msize)) {
> + err = -EINVAL;
> + goto reset_unlock;
> + }
> +
> + continue;
> + }
> +
> + prog_fd = (int)(*(unsigned long *)(udata + moff));
> + /* Similar check as the attr->attach_prog_fd */
> + if (!prog_fd)
> + continue;
> +
> + prog = bpf_prog_get(prog_fd);
> + if (IS_ERR(prog)) {
> + err = PTR_ERR(prog);
> + goto reset_unlock;
> + }
> + st_map->progs[i] = prog;
> +
> + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> + prog->aux->attach_btf_id != st_ops->type_id ||
> + prog->expected_attach_type != i) {
> + err = -EINVAL;
> + goto reset_unlock;
> + }
> +
> + err = arch_prepare_bpf_trampoline(image,
> + &st_ops->func_models[i], 0,
> + &prog, 1, NULL, 0, NULL);
> + if (err < 0)
> + goto reset_unlock;
> +
> + *(void **)(kdata + moff) = image;
> + image += err;
> +
> + /* put prog_id to udata */
> + *(unsigned long *)(udata + moff) = prog->aux->id;
> + }
Should we check whether user indeed provided `module` member or
not before declaring success?
> +
> + refcount_set(&kvalue->refcnt, 1);
> + bpf_map_inc(map);
> +
> + err = st_ops->reg(kdata);
> + if (!err) {
> + /* Pair with smp_load_acquire() during lookup */
> + smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> + goto unlock;
> + }
> +
> + /* Error during st_ops->reg() */
> + bpf_map_put(map);
> +
> +reset_unlock:
> + bpf_struct_ops_map_put_progs(st_map);
> + memset(uvalue, 0, map->value_size);
> + memset(kvalue, 0, map->value_size);
> +
> +unlock:
> + spin_unlock(&st_map->lock);
> + return err;
> +}
> +
[...]
> +
> +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> +{
> + const struct bpf_struct_ops *st_ops;
> + size_t map_total_size, st_map_size;
> + struct bpf_struct_ops_map *st_map;
> + const struct btf_type *t, *vt;
> + struct bpf_map_memory mem;
> + struct bpf_map *map;
> + int err;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> +
> + st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
> + if (!st_ops)
> + return ERR_PTR(-ENOTSUPP);
> +
> + vt = st_ops->value_type;
> + if (attr->value_size != vt->size)
> + return ERR_PTR(-EINVAL);
> +
> + t = st_ops->type;
> +
> + st_map_size = sizeof(*st_map) +
> + /* kvalue stores the
> + * struct bpf_struct_ops_tcp_congestions_ops
> + */
> + (vt->size - sizeof(struct bpf_struct_ops_value));
> + map_total_size = st_map_size +
> + /* uvalue */
> + sizeof(vt->size) +
> + /* struct bpf_progs **progs */
> + btf_type_vlen(t) * sizeof(struct bpf_prog *);
> + err = bpf_map_charge_init(&mem, map_total_size);
> + if (err < 0)
> + return ERR_PTR(err);
> +
> + st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
> + if (!st_map) {
> + bpf_map_charge_finish(&mem);
> + return ERR_PTR(-ENOMEM);
> + }
> + st_map->st_ops = st_ops;
> + map = &st_map->map;
> +
> + st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
> + st_map->progs =
> + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
> + NUMA_NO_NODE);
> + /* Each trampoline costs < 64 bytes. Ensure one page
> + * is enough for max number of func ptrs.
> + */
> + BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS);
This maybe true for x86 now, but it may not hold up for future other
architectures. Not sure whether we should get the value for arch call
backs, or we just bail out during map update if we ever grow exceeds
one page.
> + st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
> + if (!st_map->uvalue || !st_map->progs || !st_map->image) {
> + bpf_struct_ops_map_free(map);
> + bpf_map_charge_finish(&mem);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + spin_lock_init(&st_map->lock);
> + set_vm_flush_reset_perms(st_map->image);
> + set_memory_x((long)st_map->image, 1);
> + bpf_map_init_from_attr(map, attr);
> + bpf_map_charge_move(&map->memory, &mem);
> +
> + return map;
> +}
> +
> +const struct bpf_map_ops bpf_struct_ops_map_ops = {
> + .map_alloc_check = bpf_struct_ops_map_alloc_check,
> + .map_alloc = bpf_struct_ops_map_alloc,
> + .map_free = bpf_struct_ops_map_free,
> + .map_get_next_key = bpf_struct_ops_map_get_next_key,
> + .map_lookup_elem = bpf_struct_ops_map_lookup_elem,
> + .map_delete_elem = bpf_struct_ops_map_delete_elem,
> + .map_update_elem = bpf_struct_ops_map_update_elem,
> + .map_seq_show_elem = bpf_struct_ops_map_seq_show_elem,
> +};
[...]
Powered by blists - more mailing lists