netdev - Re: [PATCH v2 bpf-next 1/6] bpf: Introduce bpf sk local storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190426180411.vntu3jm2oyt5cghj@kafai-mbp.dhcp.thefacebook.com>
Date:   Fri, 26 Apr 2019 18:04:13 +0000
From:   Martin Lau <kafai@...com>
To:     Yonghong Song <yhs@...com>
CC:     "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...com>,
        Andrii Nakryiko <andriin@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        John Fastabend <john.fastabend@...il.com>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 1/6] bpf: Introduce bpf sk local storage

On Fri, Apr 26, 2019 at 10:27:16AM -0700, Yonghong Song wrote:
> 
> 
> On 4/25/19 5:10 PM, Martin KaFai Lau wrote:
> > After allowing a bpf prog to
> > - directly read the skb->sk ptr
> > - get the fullsock bpf_sock by "bpf_sk_fullsock()"
> > - get the bpf_tcp_sock by "bpf_tcp_sock()"
> > - get the listener sock by "bpf_get_listener_sock()"
> > - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
> >    into different bpf running context.
> > 
> > this patch is another effort to make bpf's network programming
> > more intuitive to do (together with memory and performance benefit).
> > 
> > When bpf prog needs to store data for a sk, the current practice is to
> > define a map with the usual 4-tuples (src/dst ip/port) as the key.
> > If multiple bpf progs require to store different sk data, multiple maps
> > have to be defined.  Hence, wasting memory to store the duplicated
> > keys (i.e. 4 tuples here) in each of the bpf map.
> > [ The smallest key could be the sk pointer itself which requires
> >    some enhancement in the verifier and it is a separate topic. ]
> > 
> > Also, the bpf prog needs to clean up the elem when sk is freed.
> > Otherwise, the bpf map will become full and un-usable quickly.
> > The sk-free tracking currently could be done during sk state
> > transition (e.g. BPF_SOCK_OPS_STATE_CB).
> > 
> > The size of the map needs to be predefined which then usually ended-up
> > with an over-provisioned map in production.  Even the map was re-sizable,
> > while the sk naturally come and go away already, this potential re-size
> > operation is arguably redundant if the data can be directly connected
> > to the sk itself instead of proxy-ing through a bpf map.
> > 
> > This patch introduces sk->sk_bpf_storage to provide local storage space
> > at sk for bpf prog to use.  The space will be allocated when the first bpf
> > prog has created data for this particular sk.
> > 
> > The design optimizes the bpf prog's lookup (and then optionally followed by
> > an inline update).  bpf_spin_lock should be used if the inline update needs
> > to be protected.
> > 
> > BPF_MAP_TYPE_SK_STORAGE:
> > -----------------------
> > To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
> > this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
> > be created to fit different bpf progs' needs.  The map enforces
> > BTF to allow printing the sk-local-storage during a system-wise
> > sk dump (e.g. "ss -ta") in the future.
> > 
> > The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
> > a "sk-local-storage" data from a particular sk.
> > Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
> > particular "type" of "sk-local-storage" data can then be stored in any sk.
> > 
> > The main purposes of this map are mostly:
> > 1. Define the size of a "sk-local-storage" type.
> > 2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
> >     map-id, map-btf...etc.)
> > 3. Keep track of all sk's storages of this "type" and clean them up
> >     when the map is freed.
> > 
> > sk->sk_bpf_storage:
> > ------------------
> > The main lookup/update/delete is done on sk->sk_bpf_storage (which
> > is a "struct bpf_sk_storage").  When doing a lookup,
> > the "map" pointer is now used as the "key" to search on the
> > sk_storage->list.  The "map" pointer is actually serving
> > as the "type" of the "sk-local-storage" that is being
> > requested.
> > 
> > To allow very fast lookup, it should be as fast as looking up an
> > array at a stable-offset.  At the same time, it is not ideal to
> > set a hard limit on the number of sk-local-storage "type" that the
> > system can have.  Hence, this patch takes a cache approach.
> > The last search result from sk_storage->list is cached in
> > sk_storage->cache[] which is a stable sized array.  Each
> > "sk-local-storage" type has a stable offset to the cache[] array.
> > In the future, a map's flag could be introduced to do cache
> > opt-out/enforcement if it became necessary.
> > 
> > All data will be removed from sk->sk_bpf_storage during
> > sk destruction.
> > 
> > bpf_sk_storage_get() and bpf_sk_storage_delete():
> > ------------------------------------------------
> > Instead of using bpf_map_(lookup|update|delete)_elem(),
> > the bpf prog needs to use the new helper bpf_sk_storage_get() and
> > bpf_sk_storage_delete().  The verifier can then enforce the
> > ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
> > "create" new elem if one does not exist in the sk.  It is done by
> > the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
> > provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
> > The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
> > it has eliminated the potential use cases for an equivalent
> > bpf_map_update_elem() API (for bpf_prog) in this patch.
> > 
> > Misc notes:
> > ----------
> > 1. map_get_next_key is not supported.  From the userspace syscall
> >     perspective,  the map has the socket fd as the key while the map
> >     can be shared by pinned-file or map-id.
> > 
> >     Since btf is enforced, the existing "ss" could be enhanced to pretty
> >     print the local-storage.
> > 
> >     Supporting a kernel defined btf with 4 tuples as the return key could
> >     be explored later also.
> > 
> > 2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
> >     e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
> >     Please refer to the source code comments for the details in
> >     synchronization cases and considerations.
> > 
> > 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
> > 
> > Benchmark:
> > ---------
> > Here is the benchmark data collected by turning on
> > the "kernel.bpf_stats_enabled" sysctl.
> > Two bpf progs are tested:
> > 
> > One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
> > sk ptr as the key. (verifier is modified to support sk ptr as the key
> > That should have shortened the key lookup time.)
> > 
> > Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
> > 
> > Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
> > each egress skb and then bump the cnt.  netperf is used to drive
> > data with 4096 connected UDP sockets.
> > 
> > BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
> > 27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
> >      loaded_at 2019-04-15T13:46:39-0700  uid 0
> >      xlated 344B  jited 258B  memlock 4096B  map_ids 16
> >      btf_id 5
> > 
> > BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
> > 30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
> >      loaded_at 2019-04-15T13:47:54-0700  uid 0
> >      xlated 168B  jited 156B  memlock 4096B  map_ids 17
> >      btf_id 6
> > 
> > Here is a high-level picture on how are the objects organized:
> > 
> >         sk
> >      +------+
> >      |      |
> >      |      |
> >      |      |
> >      |*sk_bpf_storage-----> bpf_sk_storage
> >      +------+                 +-------+
> >                   +-----------+ list  |
> >                   |           |       |
> >                   |           |       |
> >                   |           |       |
> >                   |           +-------+
> >                   |
> >                   |     elem
> >                   |  +--------+
> >                   +->| snode  |
> >                   |  +--------+
> >                   |  |  data  |          bpf_map
> >                   |  +--------+        +---------+
> >                   |  |map_node|<-+-----+  list   |
> >                   |  +--------+  |     |         |
> >                   |              |     |         |
> >                   |     elem     |     |         |
> >                   |  +--------+  |     +---------+
> >                   +->| snode  |  |
> >                      +--------+  |
> >     bpf_map          |  data  |  |
> >   +---------+        +--------+  |
> >   |  list   +------->|map_node|  |
> >   |         |        +--------+  |
> >   |         |                    |
> >   |         |           elem     |
> >   +---------+        +--------+  |
> >                   +->| snode  |  |
> >                   |  +--------+  |
> >                   |  |  data  |  |
> >                   |  +--------+  |
> >                   |  |map_node|<-+
> >                   |  +--------+
> >                   |
> >                   |
> >                   |          +-------+
> >       sk          +----------| list  |
> >    +------+                  |       |
> >    |      |                  |       |
> >    |      |                  |       |
> >    |      |                  +-------+
> >    |*sk_bpf_storage------->bpf_sk_storage
> >    +------+
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@...com>
> 
> Thanks. Generally look good. I have a few comments below.
> 
> > ---
> >   include/linux/bpf.h          |   2 +
> >   include/linux/bpf_types.h    |   1 +
> >   include/net/bpf_sk_storage.h |  13 +
> >   include/net/sock.h           |   5 +
> >   include/uapi/linux/bpf.h     |  44 +-
> >   kernel/bpf/syscall.c         |   3 +-
> >   kernel/bpf/verifier.c        |  27 +-
> >   net/bpf/test_run.c           |   2 +
> >   net/core/Makefile            |   1 +
> >   net/core/bpf_sk_storage.c    | 796 +++++++++++++++++++++++++++++++++++
> >   net/core/filter.c            |  12 +
> >   net/core/sock.c              |   5 +
> >   12 files changed, 906 insertions(+), 5 deletions(-)
> >   create mode 100644 include/net/bpf_sk_storage.h
> >   create mode 100644 net/core/bpf_sk_storage.c
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f15432d90728..0d2788beacd2 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -184,6 +184,7 @@ enum bpf_arg_type {
> >   	ARG_PTR_TO_MAP_KEY,	/* pointer to stack used as map key */
> >   	ARG_PTR_TO_MAP_VALUE,	/* pointer to stack used as map value */
> >   	ARG_PTR_TO_UNINIT_MAP_VALUE,	/* pointer to valid memory used to store a map value */
> > +	ARG_PTR_TO_MAP_VALUE_OR_NULL,	/* pointer to stack used as map value or NULL */
> >   
> >   	/* the following constraints used to prototype bpf_memcmp() and other
> >   	 * functions that access data on eBPF program stack
> > @@ -204,6 +205,7 @@ enum bpf_arg_type {
> >   	ARG_PTR_TO_SOCK_COMMON,	/* pointer to sock_common */
> >   	ARG_PTR_TO_INT,		/* pointer to int */
> >   	ARG_PTR_TO_LONG,	/* pointer to long */
> > +	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
> >   };
> >   
> >   /* type of values returned from helper functions */
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index d26991a16894..eb1e4f867a5e 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -60,6 +60,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
> >   #ifdef CONFIG_NET
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
> > +BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops)
> >   #if defined(CONFIG_BPF_STREAM_PARSER)
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
> > diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
> > new file mode 100644
> > index 000000000000..b9dcb02e756b
> > --- /dev/null
> > +++ b/include/net/bpf_sk_storage.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright (c) 2019 Facebook */
> > +#ifndef _BPF_SK_STORAGE_H
> > +#define _BPF_SK_STORAGE_H
> > +
> > +struct sock;
> > +
> > +void bpf_sk_storage_free(struct sock *sk);
> > +
> > +extern const struct bpf_func_proto bpf_sk_storage_get_proto;
> > +extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
> > +
> > +#endif /* _BPF_SK_STORAGE_H */
> [...]
> > +
> > +static struct bpf_sk_storage_data *
> > +__sk_storage_lookup(struct bpf_sk_storage *sk_storage,
> > +		    struct bpf_sk_storage_map *smap,
> > +		    bool cacheit_lockit)
> > +{
> > +	struct bpf_sk_storage_data *sdata;
> > +	struct bpf_sk_storage_elem *selem;
> > +
> > +	/* Fast path (cache hit) */
> > +	sdata = rcu_dereference(sk_storage->cache[smap->cache_idx]);
> > +	if (sdata && rcu_access_pointer(sdata->smap) == smap)
> > +		return sdata;
> > +
> > +	/* Slow path (cache miss) */
> > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode)
> > +		if (rcu_access_pointer(SDATA(selem)->smap) == smap)
> > +			break;
> > +
> > +	if (!selem)
> > +		return NULL;
> > +
> > +	sdata = SDATA(selem);
> > +	if (cacheit_lockit) {
> > +		/* spinlock is needed to avoid racing with the
> > +		 * parallel delete.  Otherwise, publishing an already
> > +		 * deleted sdata to the cache will become a use-after-free
> > +		 * problem in the next __sk_storage_lookup().
> > +		 */
> > +		raw_spin_lock_bh(&sk_storage->lock);
> > +		if (selem_linked_to_sk(selem))
> > +			rcu_assign_pointer(sk_storage->cache[smap->cache_idx],
> > +					   sdata);
> > +		raw_spin_unlock_bh(&sk_storage->lock);
> > +	}
> > +
> > +	return sdata;
> > +}
> > +
> > +static struct bpf_sk_storage_data *
> > +sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit)
> > +{
> > +	struct bpf_sk_storage *sk_storage;
> > +	struct bpf_sk_storage_map *smap;
> > +
> > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > +	if (!sk_storage)
> > +		return NULL;
> > +
> > +	smap = (struct bpf_sk_storage_map *)map;
> > +	return __sk_storage_lookup(sk_storage, smap, cacheit_lockit);
> > +}
> > +
> > +static int check_flags(const struct bpf_sk_storage_data *old_sdata,
> > +		       u64 map_flags)
> > +{
> > +	if (old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
> > +		/* elem already exists */
> > +		return -EEXIST;
> > +
> > +	if (!old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
> > +		/* elem doesn't exist, cannot update it */
> > +		return -ENOENT;
> > +
> > +	return 0;
> > +}
> > +
> > +static int sk_storage_alloc(struct sock *sk,
> > +			    struct bpf_sk_storage_map *smap,
> > +			    struct bpf_sk_storage_elem *first_selem)
> > +{
> > +	struct bpf_sk_storage *prev_sk_storage, *sk_storage;
> > +	int err;
> > +
> > +	err = omem_charge(sk, sizeof(*sk_storage));
> > +	if (err)
> > +		return err;
> > +
> > +	sk_storage = kzalloc(sizeof(*sk_storage), GFP_ATOMIC | __GFP_NOWARN);
> > +	if (!sk_storage) {
> > +		err = -ENOMEM;
> > +		goto uncharge;
> > +	}
> > +	INIT_HLIST_HEAD(&sk_storage->list);
> > +	raw_spin_lock_init(&sk_storage->lock);
> > +	sk_storage->sk = sk;
> > +
> > +	__selem_link_sk(sk_storage, first_selem);
> > +	selem_link_map(smap, first_selem);
> > +	/* Publish sk_storage to sk.  sk->sk_lock cannot be acquired.
> > +	 * Hence, atomic ops is used to set sk->sk_bpf_storage
> > +	 * from NULL to the newly allocated sk_storage ptr.
> > +	 *
> > +	 * From now on, the sk->sk_bpf_storage pointer is protected
> > +	 * by the sk_storage->lock.  Hence,  when freeing
> > +	 * the sk->sk_bpf_storage, the sk_storage->lock must
> > +	 * be held before setting sk->sk_bpf_storage to NULL.
> > +	 */
> > +	prev_sk_storage = cmpxchg((struct bpf_sk_storage **)&sk->sk_bpf_storage,
> > +				  NULL, sk_storage);
> > +	if (unlikely(prev_sk_storage)) {
> > +		selem_unlink_map(first_selem);
> > +		err = -EAGAIN;
> > +		goto uncharge;
> > +
> > +		/* Note that even first_selem was linked to smap's
> > +		 * bucket->list, first_selem can be freed immediately
> > +		 * (instead of kfree_rcu) because
> > +		 * bpf_sk_storage_map_free() does a
> > +		 * synchronize_rcu() before walking the bucket->list.
> > +		 * Hence, no one is accessing selem from the
> > +		 * bucket->list under rcu_read_lock().
> > +		 */
> > +	}
> > +
> > +	return 0;
> > +
> > +uncharge:
> > +	kfree(sk_storage);
> > +	atomic_sub(sizeof(*sk_storage), &sk->sk_omem_alloc);
> > +	return err;
> > +}
> > +
> > +/* sk cannot be going away because it is linking new elem
> > + * to sk->sk_bpf_storage. (i.e. sk->sk_refcnt cannot be 0).
> > + * Otherwise, it will become a leak (and other memory issues
> > + * during map destruction).
> > + */
> > +static struct bpf_sk_storage_data *sk_storage_update(struct sock *sk,
> > +						     struct bpf_map *map,
> > +						     void *value,
> > +						     u64 map_flags)
> > +{
> > +	struct bpf_sk_storage_data *old_sdata = NULL;
> > +	struct bpf_sk_storage_elem *selem;
> > +	struct bpf_sk_storage *sk_storage;
> > +	struct bpf_sk_storage_map *smap;
> > +	int err;
> > +
> > +	/* BPF_EXIST and BPF_NOEXIST cannot be both set */
> > +	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
> > +	    /* BPF_F_LOCK can only be used in a value with spin_lock */
> > +	    unlikely((map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	smap = (struct bpf_sk_storage_map *)map;
> > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > +	if (!sk_storage || hlist_empty(&sk_storage->list)) {
> 
> Is it possible sk_storage is not NULL and hlist_empty(&sk_storage->list) 
> is true?
Very unlikely but possible.  Note that the sk->sk_bpf_storage ptr is protected
by atomic cmpxchg().

> Is this case, should we avoid allocation of new sk_storage at all?
It should still try which has a good chance that the cmpxchg(..., NULL, ...)
in sk_storage_alloc() will succeed.

> 
> > +		/* Very first elem for this sk */
> > +		err = check_flags(NULL, map_flags);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +
> > +		selem = selem_alloc(smap, sk, value, true);
> > +		if (!selem)
> > +			return ERR_PTR(-ENOMEM);
> > +
> > +		err = sk_storage_alloc(sk, smap, selem);
> > +		if (err) {
> > +			kfree(selem);
> > +			atomic_sub(smap->elem_size, &sk->sk_omem_alloc);
> 
> discharding sk->sk_omem_alloc already happens in sk_storage_alloc?
They are discharging two different things.
sk_storage_alloc() discharges the sk_storage.
Here it dischages the selem.

> 
> > +			return ERR_PTR(err);
> > +		}
> > +
> > +		return SDATA(selem);
> > +	}
> > +
> > +	if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) {
> > +		/* Hoping to find an old_sdata to do inline update
> > +		 * such that it can avoid taking the sk_storage->lock
> > +		 * and changing the lists.
> > +		 */
> > +		old_sdata = __sk_storage_lookup(sk_storage, smap, false);
> > +		err = check_flags(old_sdata, map_flags);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +		if (old_sdata && selem_linked_to_sk(SELEM(old_sdata))) {
> > +			copy_map_value_locked(map, old_sdata->data,
> > +					      value, false);
> > +			return old_sdata;
> > +		}
> > +	}
> > +
> > +	raw_spin_lock_bh(&sk_storage->lock);
> > +
> > +	/* Recheck sk_storage->list under sk_storage->lock */
> > +	if (unlikely(hlist_empty(&sk_storage->list))) {
> > +		/* A parallel del is happening and sk_storage is going
> > +		 * away.  It has just been checked before, so very
> > +		 * unlikely.  Return instead of retry to keep things
> > +		 * simple.
> > +		 */
> > +		err = -EAGAIN;
> > +		goto unlock_err;
> > +	}
> > +
> > +	old_sdata = __sk_storage_lookup(sk_storage, smap, false);
> > +	err = check_flags(old_sdata, map_flags);
> > +	if (err)
> > +		goto unlock_err;
> > +
> > +	if (old_sdata && (map_flags & BPF_F_LOCK)) {
> > +		copy_map_value_locked(map, old_sdata->data, value, false);
> > +		selem = SELEM(old_sdata);
> > +		goto unlock;
> > +	}
> > +
> > +	/* sk_storage->lock is held.  Hence, we are sure
> > +	 * we can unlink and uncharge the old_sdata successfully
> > +	 * later.  Hence, instead of charging the new selem now
> > +	 * and then uncharge the old selem later (which may cause
> > +	 * a potential but unnecessary charge failure),  avoid taking
> > +	 * a charge at all here (the "!old_sdata" check) and the
> > +	 * old_sdata will not be uncharged later during __selem_unlink_sk().
> > +	 */
> > +	selem = selem_alloc(smap, sk, value, !old_sdata);
> > +	if (!selem) {
> > +		err = -ENOMEM;
> > +		goto unlock_err;
> > +	}
> > +
> > +	/* First, link the new selem to the map */
> > +	selem_link_map(smap, selem);
> > +
> > +	/* Second, link (and publish) the new selem to sk_storage */
> > +	__selem_link_sk(sk_storage, selem);
> > +
> > +	/* Third, remove old selem, SELEM(old_sdata) */
> > +	if (old_sdata) {
> > +		selem_unlink_map(SELEM(old_sdata));
> > +		__selem_unlink_sk(sk_storage, SELEM(old_sdata), false);
> > +	}
> > +
> > +unlock:
> > +	raw_spin_unlock_bh(&sk_storage->lock);
> > +	return SDATA(selem);
> > +
> > +unlock_err:
> > +	raw_spin_unlock_bh(&sk_storage->lock);
> > +	return ERR_PTR(err);
> > +}
> > +
> > +static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
> > +{
> > +	struct bpf_sk_storage_data *sdata;
> > +
> > +	sdata = sk_storage_lookup(sk, map, false);
> > +	if (!sdata)
> > +		return -ENOENT;
> > +
> > +	selem_unlink(SELEM(sdata));
> > +
> > +	return 0;
> > +}
> > +
> [...]
> > +static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr)
> > +{
> > +	struct bpf_sk_storage_map *smap;
> > +	unsigned int i;
> > +	u32 nbuckets;
> > +	u64 cost;
> > +
> > +	if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries ||
> > +	    attr->key_size != sizeof(int) || !attr->value_size ||
> > +	    /* Enforce BTF for userspace sk dumping */
> > +	    !attr->btf_key_type_id || !attr->btf_value_type_id)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	if (!capable(CAP_SYS_ADMIN))
> > +		return ERR_PTR(-EPERM);
> > +
> > +	if (attr->value_size >= KMALLOC_MAX_SIZE -
> > +	    MAX_BPF_STACK - sizeof(struct bpf_sk_storage_elem) ||
> > +	    /* selem->elem_size is a u16 */
> > +	    attr->value_size > U16_MAX - sizeof(struct bpf_sk_storage_elem))
> > +		return ERR_PTR(-E2BIG);
> > +
> > +	smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN);
> > +	if (!smap)
> > +		return ERR_PTR(-ENOMEM);
> > +	bpf_map_init_from_attr(&smap->map, attr);
> > +
> > +	smap->bucket_log = ilog2(roundup_pow_of_two(num_possible_cpus()));
> > +	nbuckets = 1U << smap->bucket_log;
> > +	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
> > +				 GFP_USER | __GFP_NOWARN);
> > +	if (!smap->buckets) {
> > +		kfree(smap);
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +	cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap);
> > +
> > +	for (i = 0; i < nbuckets; i++) {
> > +		INIT_HLIST_HEAD(&smap->buckets[i].list);
> > +		raw_spin_lock_init(&smap->buckets[i].lock);
> > +	}
> > +
> > +	smap->elem_size = sizeof(struct bpf_sk_storage_elem) + attr->value_size;
> > +	smap->cache_idx = (unsigned int)atomic_inc_return(&cache_idx) %
> > +		BPF_SK_STORAGE_CACHE_SIZE;
> > +	smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> 
> should we test bpf_map_precharge_memlock() here?
I did not do it here because the size of this map does not depend on
"map.max_entries".  Meaning,
this map merely needs the memory for the buckets which is less likely to
fail (and destroying in case of final charge failure is not that expensive
either) which then makes this precheck not as useful as other maps.

Thanks for reviewing!

> 
> > +
> > +	return &smap->map;
> > +}
> > +
> > +static int notsupp_get_next_key(struct bpf_map *map, void *key,
> > +				void *next_key)
> > +{
> > +	return -ENOTSUPP;
> > +}
> > +
> > +static int bpf_sk_storage_map_check_btf(const struct bpf_map *map,
> > +					const struct btf *btf,
> > +					const struct btf_type *key_type,
> > +					const struct btf_type *value_type)
> > +{
> > +	u32 int_data;
> > +
> > +	if (BTF_INFO_KIND(key_type->info) != BTF_KIND_INT)
> > +		return -EINVAL;
> > +
> > +	int_data = *(u32 *)(key_type + 1);
> > +	if (BTF_INT_BITS(int_data) != 32 || BTF_INT_OFFSET(int_data))
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> [...]> diff --git a/net/core/sock.c b/net/core/sock.c
> > index 443b98d05f1e..9773be724aa9 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -137,6 +137,7 @@
> >   
> >   #include <linux/filter.h>
> >   #include <net/sock_reuseport.h>
> > +#include <net/bpf_sk_storage.h>
> >   
> >   #include <trace/events/sock.h>
> >   
> > @@ -1709,6 +1710,10 @@ static void __sk_destruct(struct rcu_head *head)
> >   
> >   	sock_disable_timestamp(sk, SK_FLAGS_TIMESTAMP);
> >   
> > +#ifdef CONFIG_BPF_SYSCALL
> > +	bpf_sk_storage_free(sk);
> > +#endif
> > +
> >   	if (atomic_read(&sk->sk_omem_alloc))
> >   		pr_debug("%s: optmem leakage (%d bytes) detected\n",
> >   			 __func__, atomic_read(&sk->sk_omem_alloc));
> >