[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7032bc3c-9375-876d-8d97-1ed21e94ae92@iogearbox.net>
Date: Sat, 16 Nov 2019 00:23:59 +0100
From: Daniel Borkmann <daniel@...earbox.net>
To: Andrii Nakryiko <andriin@...com>, bpf@...r.kernel.org,
netdev@...r.kernel.org, ast@...com
Cc: andrii.nakryiko@...il.com, kernel-team@...com
Subject: Re: [PATCH v4 bpf-next 1/4] bpf: switch bpf_map ref counter to 64bit
so bpf_map_inc never fails
On 11/15/19 5:02 AM, Andrii Nakryiko wrote:
> 92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
> potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
> (32k). Due to using 32-bit counter, it's possible in practice to overflow
> refcounter and make it wrap around to 0, causing erroneous map free, while
> there are still references to it, causing use-after-free problems.
You mention 'it's possible in practice' to overflow in relation to bpf map
refcount, do we need any fixing for bpf tree here?
> But having a failing refcounting operations are problematic in some cases. One
> example is mmap() interface. After establishing initial memory-mapping, user
> is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
> splitting it into multiple non-contiguous regions. All this happening without
> any control from the users of mmap subsystem. Rather mmap subsystem sends
> notifications to original creator of memory mapping through open/close
> callbacks, which are optionally specified during initial memory mapping
> creation. These callbacks are used to maintain accurate refcount for bpf_map
> (see next patch in this series). The problem is that open() callback is not
> supposed to fail, because memory-mapped resource is set up and properly
> referenced. This is posing a problem for using memory-mapping with BPF maps.
>
> One solution to this is to maintain separate refcount for just memory-mappings
> and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
> There are similar use cases in current work on tcp-bpf, necessitating extra
> counter as well. This seems like a rather unfortunate and ugly solution that
> doesn't scale well to various new use cases.
>
> Another approach to solve this is to use non-failing refcount_t type, which
> uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
> stays there. This utlimately causes memory leak, but prevents use after free.
>
> But given refcounting is not the most performance-critical operation with BPF
> maps (it's not used from running BPF program code), we can also just switch to
> 64-bit counter that can't overflow in practice, potentially disadvantaging
> 32-bit platforms a tiny bit. This simplifies semantics and allows above
> described scenarios to not worry about failing refcount increment operation.
>
> In terms of struct bpf_map size, we are still good and use the same amount of
> space:
>
> BEFORE (3 cache lines, 8 bytes of padding at the end):
> struct bpf_map {
> const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
> struct bpf_map * inner_map_meta; /* 8 8 */
> void * security; /* 16 8 */
> enum bpf_map_type map_type; /* 24 4 */
> u32 key_size; /* 28 4 */
> u32 value_size; /* 32 4 */
> u32 max_entries; /* 36 4 */
> u32 map_flags; /* 40 4 */
> int spin_lock_off; /* 44 4 */
> u32 id; /* 48 4 */
> int numa_node; /* 52 4 */
> u32 btf_key_type_id; /* 56 4 */
> u32 btf_value_type_id; /* 60 4 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> struct btf * btf; /* 64 8 */
> struct bpf_map_memory memory; /* 72 16 */
> bool unpriv_array; /* 88 1 */
> bool frozen; /* 89 1 */
>
> /* XXX 38 bytes hole, try to pack */
>
> /* --- cacheline 2 boundary (128 bytes) --- */
> atomic_t refcnt __attribute__((__aligned__(64))); /* 128 4 */
> atomic_t usercnt; /* 132 4 */
> struct work_struct work; /* 136 32 */
> char name[16]; /* 168 16 */
>
> /* size: 192, cachelines: 3, members: 21 */
> /* sum members: 146, holes: 1, sum holes: 38 */
> /* padding: 8 */
> /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
> } __attribute__((__aligned__(64)));
>
> AFTER (same 3 cache lines, no extra padding now):
> struct bpf_map {
> const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
> struct bpf_map * inner_map_meta; /* 8 8 */
> void * security; /* 16 8 */
> enum bpf_map_type map_type; /* 24 4 */
> u32 key_size; /* 28 4 */
> u32 value_size; /* 32 4 */
> u32 max_entries; /* 36 4 */
> u32 map_flags; /* 40 4 */
> int spin_lock_off; /* 44 4 */
> u32 id; /* 48 4 */
> int numa_node; /* 52 4 */
> u32 btf_key_type_id; /* 56 4 */
> u32 btf_value_type_id; /* 60 4 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> struct btf * btf; /* 64 8 */
> struct bpf_map_memory memory; /* 72 16 */
> bool unpriv_array; /* 88 1 */
> bool frozen; /* 89 1 */
>
> /* XXX 38 bytes hole, try to pack */
>
> /* --- cacheline 2 boundary (128 bytes) --- */
> atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */
> atomic64_t usercnt; /* 136 8 */
> struct work_struct work; /* 144 32 */
> char name[16]; /* 176 16 */
>
> /* size: 192, cachelines: 3, members: 21 */
> /* sum members: 154, holes: 1, sum holes: 38 */
> /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
> } __attribute__((__aligned__(64)));
>
> This patch, while modifying all users of bpf_map_inc, also cleans up its
> interface to match bpf_map_put with separate operations for bpf_map_inc and
> bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
> respectively). Also, given there are no users of bpf_map_inc_not_zero
> specifying uref=true, remove uref flag and default to uref=false internally.
>
> Signed-off-by: Andrii Nakryiko <andriin@...com>
> ---
> .../net/ethernet/netronome/nfp/bpf/offload.c | 4 +-
> include/linux/bpf.h | 10 ++--
> kernel/bpf/inode.c | 2 +-
> kernel/bpf/map_in_map.c | 2 +-
> kernel/bpf/syscall.c | 51 ++++++++-----------
> kernel/bpf/verifier.c | 6 +--
> kernel/bpf/xskmap.c | 6 +--
> net/core/bpf_sk_storage.c | 2 +-
> 8 files changed, 34 insertions(+), 49 deletions(-)
>
[...]
> + refold = atomic64_fetch_add_unless(&map->refcnt, 1, 0);
> if (!refold)
> return ERR_PTR(-ENOENT);
> -
> if (uref)
> - atomic_inc(&map->usercnt);
> + atomic64_inc(&map->usercnt);
>
> return map;
> }
>
> -struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
> +struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map)
> {
> spin_lock_bh(&map_idr_lock);
> - map = __bpf_map_inc_not_zero(map, uref);
> + map = __bpf_map_inc_not_zero(map, false);
> spin_unlock_bh(&map_idr_lock);
>
> return map;
> @@ -1454,6 +1444,9 @@ static struct bpf_prog *____bpf_prog_get(struct fd f)
> return f.file->private_data;
> }
>
> +/* prog's refcnt limit */
> +#define BPF_MAX_REFCNT 32768
> +
Would it make sense in this context to also convert prog refcount over to atomic64
so we have both in one go in order to align them again? This could likely simplify
even more wrt error paths.
> struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
> {
> if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) {
Powered by blists - more mailing lists