netdev - Re: [Potential Spoof] [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6ca4e58d-4bb3-bb4b-0fe2-e10a29a53d1f@fb.com>
Date:   Tue, 17 Dec 2019 07:48:18 +0000
From:   Yonghong Song <yhs@...com>
To:     Martin Lau <kafai@...com>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>
CC:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        David Miller <davem@...emloft.net>,
        Kernel Team <Kernel-team@...com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [Potential Spoof] [PATCH bpf-next 06/13] bpf: Introduce
 BPF_MAP_TYPE_STRUCT_OPS



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> is a kernel struct with its func ptr implemented in bpf prog.
> This new map is the interface to register/unregister/introspect
> a bpf implemented kernel struct.
> 
> The kernel struct is actually embedded inside another new struct
> (or called the "value" struct in the code).  For example,
> "struct tcp_congestion_ops" is embbeded in:
> struct __bpf_tcp_congestion_ops {
> 	refcount_t refcnt;
> 	enum bpf_struct_ops_state state;
> 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> }
> The map value is "struct __bpf_tcp_congestion_ops".  The "bpftool map dump"
> will then be able to show the state ("inuse"/"tobefree") and the number of
> subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case).
> This "value" struct is created automatically by a macro.  Having a separate
> "value" struct will also make extending "struct __bpf_XYZ" easier (e.g. adding
> "void (*init)(void)" to "struct __bpf_XYZ" to do some initialization
> works before registering the struct_ops to the kernel subsystem).
> The libbpf will take care of finding and populating the "struct __bpf_XYZ"
> from "struct XYZ".
> 
> Register a struct_ops to a kernel subsystem:
> 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
>     set to the btf id "struct __bpf_tcp_congestion_ops" of the running
>     kernel.
>     Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id
>     is added such that attr->btf_fd can still be used as the "user" btf
>     which could store other useful sysadmin/debug info that may be
>     introduced in the furture,
>     e.g. creation-date/compiler-details/map-creator...etc.
> 3. Create a "struct __bpf_tcp_congestion_ops" object as described in
>     the running kernel btf.  Populate the value of this object.
>     The function ptr should be populated with the prog fds.
> 4. Call BPF_MAP_UPDATE with the object created in (3) as
>     the map value.  The key is always "0".

This is really a special one element map. The key "0" should work.
Not sure whether we should generalize this and maps for global variables
to a kind of key-less map. Just some thought.

> 
> During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> the specific struct_ops to do some final checks in "st_ops->init_member()"
> (e.g. ensure all mandatory func ptrs are implemented).
> If everything looks good, it will register this kernel struct
> to the kernel subsystem.  The map will not allow further update
> from this point.
> 
> Unregister a struct_ops from the kernel subsystem:
> BPF_MAP_DELETE with key "0".
> 
> Introspect a struct_ops:
> BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> have the prog _id_ populated as the func ptr.
> 
> The map value state (enum bpf_struct_ops_state) will transit from:
> INIT (map created) =>
> INUSE (map updated, i.e. reg) =>
> TOBEFREE (map value deleted, i.e. unreg)
> 
> Note that the above state is not exposed to the uapi/bpf.h.
> It will be obtained from the btf of the running kernel.

It is not really from btf, right? It is from kernel internal
data structure which will be copied to user space.

Since such information is available to bpftool dump and is common
to all st_ops maps. I am wondering whether we should expose
this through uapi.

> 
> The kernel subsystem needs to call bpf_struct_ops_get() and
> bpf_struct_ops_put() to manage the "refcnt" in the "struct __bpf_XYZ".
> This patch uses a separate refcnt for the purose of tracking the
> subsystem usage.  Another approach is to reuse the map->refcnt
> and then "show" (i.e. during map_lookup) the subsystem's usage
> by doing map->refcnt - map->usercnt to filter out the
> map-fd/pinned-map usage.  However, that will also tie down the
> future semantics of map->refcnt and map->usercnt.
> 
> The very first subsystem's refcnt (during reg()) holds one
> count to map->refcnt.  When the very last subsystem's refcnt
> is gone, it will also release the map->refcnt.  All bpf_prog will be
> freed when the map->refcnt reaches 0 (i.e. during map_free()).
> 
> Here is how the bpftool map command will look like:
> [root@...h-fb-vm1 bpf]# bpftool map show
> 6: struct_ops  name dctcp  flags 0x0
> 	key 4B  value 256B  max_entries 1  memlock 4096B
> 	btf_id 6
> [root@...h-fb-vm1 bpf]# bpftool map dump id 6
> [{
>          "value": {
>              "refcnt": {
>                  "refs": {
>                      "counter": 1
>                  }
>              },
>              "state": 1,
>              "data": {
>                  "list": {
>                      "next": 0,
>                      "prev": 0
>                  },
>                  "key": 0,
>                  "flags": 2,
>                  "init": 24,
>                  "release": 0,
>                  "ssthresh": 25,
>                  "cong_avoid": 30,
>                  "set_state": 27,
>                  "cwnd_event": 28,
>                  "in_ack_event": 26,
>                  "undo_cwnd": 29,
>                  "pkts_acked": 0,
>                  "min_tso_segs": 0,
>                  "sndbuf_expand": 0,
>                  "cong_control": 0,
>                  "get_info": 0,
>                  "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0

Not related to this patch. It will be good if we can
make "name" printing better.

>                  ],
>                  "owner": 0
>              }
>          }
>      }
> ]
> 
> Misc Notes:
> * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
>    It does an inplace update on "*value" instead returning a pointer
>    to syscall.c.  Otherwise, it needs a separate copy of "zero" value
>    for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
> 
> * The bpf_struct_ops_map_delete_elem() is also called without
>    preempt_disable() from map_delete_elem().  It is because
>    the "->unreg()" may requires sleepable context, e.g.
>    the "tcp_unregister_congestion_control()".

This probably fine, we do not per cpu data structure here and lookup
will fail after init stage. Some comments in the code will be good.

> 
> * "const" is added to some of the existing "struct btf_func_model *"
>    function arg to avoid a compiler warning caused by this patch.
> 
> Signed-off-by: Martin KaFai Lau <kafai@...com>
> ---
>   arch/x86/net/bpf_jit_comp.c |  10 +-
>   include/linux/bpf.h         |  49 +++-
>   include/linux/bpf_types.h   |   3 +
>   include/linux/btf.h         |  11 +
>   include/uapi/linux/bpf.h    |   7 +-
>   kernel/bpf/bpf_struct_ops.c | 465 +++++++++++++++++++++++++++++++++++-
>   kernel/bpf/btf.c            |  20 +-
>   kernel/bpf/map_in_map.c     |   3 +-
>   kernel/bpf/syscall.c        |  47 ++--
>   kernel/bpf/trampoline.c     |   5 +-
>   kernel/bpf/verifier.c       |   5 +
>   11 files changed, 585 insertions(+), 40 deletions(-)
> 
[...]
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index fadd243ffa2d..9f326e6ef885 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>   #endif
>   BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> +#if defined(CONFIG_BPF_JIT)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
> +#endif
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index f74a09a7120b..49094564f1f1 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -60,6 +60,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
>   					    u32 id, u32 *res_id);
>   const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
>   						 u32 id, u32 *res_id);
> +const struct btf_type *
> +btf_resolve_size(const struct btf *btf, const struct btf_type *type,
> +		 u32 *type_size, const struct btf_type **elem_type,
> +		 u32 *total_nelems);
>   
>   #define for_each_member(i, struct_type, member)			\
>   	for (i = 0, member = btf_type_member(struct_type);	\
> @@ -106,6 +110,13 @@ static inline bool btf_type_kflag(const struct btf_type *t)
>   	return BTF_INFO_KFLAG(t->info);
>   }
>   
> +static inline u32 btf_member_bit_offset(const struct btf_type *struct_type,
> +					const struct btf_member *member)
> +{
> +	return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
> +					   : member->offset;
> +}
> +
>   static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
>   					   const struct btf_member *member)
>   {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 12900dfa1461..8809212d9d6c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -136,6 +136,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_STACK,
>   	BPF_MAP_TYPE_SK_STORAGE,
>   	BPF_MAP_TYPE_DEVMAP_HASH,
> +	BPF_MAP_TYPE_STRUCT_OPS,
>   };
>   
>   /* Note that tracing related programs such as
> @@ -392,6 +393,10 @@ union bpf_attr {
>   		__u32	btf_fd;		/* fd pointing to a BTF type data */
>   		__u32	btf_key_type_id;	/* BTF type_id of the key */
>   		__u32	btf_value_type_id;	/* BTF type_id of the value */
> +		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
> +						   * struct stored as the
> +						   * map value
> +						   */
>   	};
>   
>   	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> @@ -3340,7 +3345,7 @@ struct bpf_map_info {
>   	__u32 map_flags;
>   	char  name[BPF_OBJ_NAME_LEN];
>   	__u32 ifindex;
> -	__u32 :32;
> +	__u32 btf_vmlinux_value_type_id;
>   	__u64 netns_dev;
>   	__u64 netns_ino;
>   	__u32 btf_id;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 817d5aac42e5..00f49ac1342d 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -12,8 +12,68 @@
>   #include <linux/seq_file.h>
>   #include <linux/refcount.h>
>   
> +enum bpf_struct_ops_state {
> +	BPF_STRUCT_OPS_STATE_INIT,
> +	BPF_STRUCT_OPS_STATE_INUSE,
> +	BPF_STRUCT_OPS_STATE_TOBEFREE,
> +};
> +
> +#define BPF_STRUCT_OPS_COMMON_VALUE			\
> +	refcount_t refcnt;				\
> +	enum bpf_struct_ops_state state
> +
> +struct bpf_struct_ops_value {
> +	BPF_STRUCT_OPS_COMMON_VALUE;
> +	char data[0] ____cacheline_aligned_in_smp;
> +};
> +
> +struct bpf_struct_ops_map {
> +	struct bpf_map map;
> +	const struct bpf_struct_ops *st_ops;
> +	/* protect map_update */
> +	spinlock_t lock;
> +	/* progs has all the bpf_prog that is populated
> +	 * to the func ptr of the kernel's struct
> +	 * (in kvalue.data).
> +	 */
> +	struct bpf_prog **progs;
> +	/* image is a page that has all the trampolines
> +	 * that stores the func args before calling the bpf_prog.
> +	 * A PAGE_SIZE "image" is enough to store all trampoline for
> +	 * "progs[]".
> +	 */
> +	void *image;
> +	/* uvalue->data stores the kernel struct
> +	 * (e.g. tcp_congestion_ops) that is more useful
> +	 * to userspace than the kvalue.  For example,
> +	 * the bpf_prog's id is stored instead of the kernel
> +	 * address of a func ptr.
> +	 */
> +	struct bpf_struct_ops_value *uvalue;
> +	/* kvalue.data stores the actual kernel's struct
> +	 * (e.g. tcp_congestion_ops) that will be
> +	 * registered to the kernel subsystem.
> +	 */
> +	struct bpf_struct_ops_value kvalue;
> +};
> +
> +#define VALUE_PREFIX "__bpf_"
> +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
> +
> +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
> + * exposed to the userspace and its btf-type-id is stored
> + * at the map->btf_vmlinux_value_type_id.
> + *
> + * The *_name##_dummy is to ensure the BTF type is emitted.
> + */
> +
>   #define BPF_STRUCT_OPS_TYPE(_name)				\
> -extern struct bpf_struct_ops bpf_##_name;
> +extern struct bpf_struct_ops bpf_##_name;			\
> +								\
> +static struct __bpf_##_name {					\
> +	BPF_STRUCT_OPS_COMMON_VALUE;				\
> +	struct _name data ____cacheline_aligned_in_smp;		\
> +} *_name##_dummy;

There are other ways to retain types in debug info without
creating new variables. For example, you can use it in a cast
like
     (void *)(struct __bpf_##_name *)v
Not sure whether we could easily find a place for such casting or not.

>   #include "bpf_struct_ops_types.h"
>   #undef BPF_STRUCT_OPS_TYPE
>   
> @@ -37,19 +97,46 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
>   const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
>   };
>   
> +static const struct btf_type *module_type;
> +
>   void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   {
> +	char value_name[128] = VALUE_PREFIX;
> +	s32 type_id, value_id, module_id;
>   	const struct btf_member *member;
>   	struct bpf_struct_ops *st_ops;
>   	struct bpf_verifier_log log = {};
>   	const struct btf_type *t;
>   	const char *mname;
> -	s32 type_id;
>   	u32 i, j;
>   
> +	/* Avoid unused var compiler warning */
> +#define BPF_STRUCT_OPS_TYPE(_name) (void)(_name##_dummy);
> +#include "bpf_struct_ops_types.h"
> +#undef BPF_STRUCT_OPS_TYPE
> +
> +	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
> +					  BTF_KIND_STRUCT);
> +	if (module_id < 0) {
> +		pr_warn("Cannot find struct module in btf_vmlinux\n");
> +		return;
> +	}
> +	module_type = btf_type_by_id(_btf_vmlinux, module_id);
> +
>   	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
>   		st_ops = bpf_struct_ops[i];
>   
> +		value_name[VALUE_PREFIX_LEN] = '\0';
> +		strncat(value_name + VALUE_PREFIX_LEN, st_ops->name,
> +			sizeof(value_name) - VALUE_PREFIX_LEN - 1);

Do we have restrictions on the length of st_ops->name?
We probably do not want truncation, right?

> +		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
> +						 BTF_KIND_STRUCT);
> +		if (value_id < 0) {
> +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> +				value_name);
> +			continue;
> +		}
> +
>   		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
>   						BTF_KIND_STRUCT);
>   		if (type_id < 0) {
> @@ -101,6 +188,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   			} else {
>   				st_ops->type_id = type_id;
>   				st_ops->type = t;
> +				st_ops->value_id = value_id;
> +				st_ops->value_type =
> +					btf_type_by_id(_btf_vmlinux, value_id);
>   			}
>   		}
>   	}
> @@ -108,6 +198,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   
>   extern struct btf *btf_vmlinux;
>   
> +static const struct bpf_struct_ops *
> +bpf_struct_ops_find_value(u32 value_id)
> +{
> +	unsigned int i;
> +
> +	if (!value_id || !btf_vmlinux)
> +		return NULL;
> +
> +	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> +		if (bpf_struct_ops[i]->value_id == value_id)
> +			return bpf_struct_ops[i];
> +	}
> +
> +	return NULL;
> +}
> +
>   const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
>   {
>   	unsigned int i;
> @@ -122,3 +228,358 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
>   
>   	return NULL;
>   }
> +
> +static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key,
> +					   void *next_key)
> +{
> +	u32 index = key ? *(u32 *)key : U32_MAX;
> +	u32 *next = (u32 *)next_key;
> +
> +	if (index >= map->max_entries) {
> +		*next = 0;
> +		return 0;
> +	}

We know the the max_entries must be 1. Maybe we can simplify the code
accordingly.

> +
> +	if (index == map->max_entries - 1)
> +		return -ENOENT;
> +
> +	*next = index + 1;
> +	return 0;
> +}
> +
> +int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
> +				       void *value)
> +{
> +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> +	struct bpf_struct_ops_value *uvalue, *kvalue;
> +	enum bpf_struct_ops_state state;
> +
> +	if (unlikely(*(u32 *)key != 0))
> +		return -ENOENT;
> +
> +	kvalue = &st_map->kvalue;
> +	state = smp_load_acquire(&kvalue->state);

Some one-line comment here? Also for below smp_store_release()?
A simple comment for important synchronization/barrier point will help
code review a lot.

> +	if (state == BPF_STRUCT_OPS_STATE_INIT) {
> +		memset(value, 0, map->value_size);
> +		return 0;
> +	}
> +
> +	/* No lock is needed.  state and refcnt do not need
> +	 * to be updated together under atomic context.
> +	 */
> +	uvalue = (struct bpf_struct_ops_value *)value;
> +	memcpy(uvalue, st_map->uvalue, map->value_size);
> +	uvalue->state = state;
> +	refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
> +
> +	return 0;
> +}
> +
> +static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map)
> +{
> +	const struct btf_type *t = st_map->st_ops->type;
> +	u32 i;
> +
> +	for (i = 0; i < btf_type_vlen(t); i++) {
> +		if (st_map->progs[i]) {
> +			bpf_prog_put(st_map->progs[i]);
> +			st_map->progs[i] = NULL;
> +		}
> +	}
> +}
> +
> +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> +					  void *value, u64 flags)
> +{
> +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> +	const struct bpf_struct_ops *st_ops = st_map->st_ops;
> +	struct bpf_struct_ops_value *uvalue, *kvalue;
> +	const struct btf_member *member;
> +	const struct btf_type *t = st_ops->type;
> +	void *udata, *kdata;
> +	int prog_fd, err = 0;
> +	void *image;
> +	u32 i;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (*(u32 *)key != 0)
> +		return -E2BIG;
> +
> +	uvalue = (struct bpf_struct_ops_value *)value;
> +	if (uvalue->state || refcount_read(&uvalue->refcnt))
> +		return -EINVAL;
> +
> +	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> +	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> +
> +	spin_lock(&st_map->lock);
> +
> +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> +		err = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	memcpy(uvalue, value, map->value_size);
> +
> +	udata = &uvalue->data;
> +	kdata = &kvalue->data;
> +	image = st_map->image;
> +
> +	for_each_member(i, t, member) {
> +		const struct btf_type *mtype, *ptype;
> +		struct bpf_prog *prog;
> +		u32 moff;
> +
> +		moff = btf_member_bit_offset(t, member) / 8;
> +		mtype = btf_type_by_id(btf_vmlinux, member->type);
> +		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> +		if (ptype == module_type) {
> +			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
> +			continue;
> +		}
> +
> +		err = st_ops->init_member(t, member, kdata, udata);
> +		if (err < 0)
> +			goto reset_unlock;
> +
> +		/* The ->init_member() has handled this member */
> +		if (err > 0)
> +			continue;
> +
> +		/* If st_ops->init_member does not handle it,
> +		 * we will only handle func ptrs and zero-ed members
> +		 * here.  Reject everything else.
> +		 */
> +
> +		/* All non func ptr member must be 0 */
> +		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> +					       NULL)) {
> +			u32 msize;
> +
> +			mtype = btf_resolve_size(btf_vmlinux, mtype,
> +						 &msize, NULL, NULL);
> +			if (IS_ERR(mtype)) {
> +				err = PTR_ERR(mtype);
> +				goto reset_unlock;
> +			}
> +
> +			if (memchr_inv(udata + moff, 0, msize)) {
> +				err = -EINVAL;
> +				goto reset_unlock;
> +			}
> +
> +			continue;
> +		}
> +
> +		prog_fd = (int)(*(unsigned long *)(udata + moff));
> +		/* Similar check as the attr->attach_prog_fd */
> +		if (!prog_fd)
> +			continue;
> +
> +		prog = bpf_prog_get(prog_fd);
> +		if (IS_ERR(prog)) {
> +			err = PTR_ERR(prog);
> +			goto reset_unlock;
> +		}
> +		st_map->progs[i] = prog;
> +
> +		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> +		    prog->aux->attach_btf_id != st_ops->type_id ||
> +		    prog->expected_attach_type != i) {
> +			err = -EINVAL;
> +			goto reset_unlock;
> +		}
> +
> +		err = arch_prepare_bpf_trampoline(image,
> +						  &st_ops->func_models[i], 0,
> +						  &prog, 1, NULL, 0, NULL);
> +		if (err < 0)
> +			goto reset_unlock;
> +
> +		*(void **)(kdata + moff) = image;
> +		image += err;

Do we still want to check whether image out of page boundary or not?

> +
> +		/* put prog_id to udata */
> +		*(unsigned long *)(udata + moff) = prog->aux->id;
> +	}
> +
> +	refcount_set(&kvalue->refcnt, 1);
> +	bpf_map_inc(map);
> +
> +	err = st_ops->reg(kdata);
> +	if (!err) {
> +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> +		goto unlock;
> +	}
> +
> +	/* Error during st_ops->reg() */
> +	bpf_map_put(map);
> +
> +reset_unlock:
> +	bpf_struct_ops_map_put_progs(st_map);
> +	memset(uvalue, 0, map->value_size);
> +	memset(kvalue, 0, map->value_size);
> +
> +unlock:
> +	spin_unlock(&st_map->lock);
> +	return err;
> +}
> +
> +static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	enum bpf_struct_ops_state prev_state;
> +	struct bpf_struct_ops_map *st_map;
> +
> +	st_map = (struct bpf_struct_ops_map *)map;
> +	prev_state = cmpxchg(&st_map->kvalue.state,
> +			     BPF_STRUCT_OPS_STATE_INUSE,
> +			     BPF_STRUCT_OPS_STATE_TOBEFREE);
> +	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
> +		st_map->st_ops->unreg(&st_map->kvalue.data);
> +		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
> +			bpf_map_put(map);
> +	}
> +
> +	return 0;
> +}
> +
[...]