netdev - Re: [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when freeing map or local storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMB2axNtrFEL0x+j6M3De-ezR38cv5s7LFtpuAeN7QCf_AyHaQ@mail.gmail.com>
Date: Mon, 12 Jan 2026 14:38:13 -0800
From: Amery Hung <ameryhung@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: netdev@...r.kernel.org, alexei.starovoitov@...il.com, andrii@...nel.org, 
	daniel@...earbox.net, memxor@...il.com, martin.lau@...nel.org, 
	kpsingh@...nel.org, yonghong.song@...ux.dev, song@...nel.org, 
	haoluo@...gle.com, kernel-team@...a.com, bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when
 freeing map or local storage

On Fri, Jan 9, 2026 at 1:38 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
>
>
> On 1/9/26 12:47 PM, Amery Hung wrote:
> > On Fri, Jan 9, 2026 at 12:16 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >> On 12/18/25 9:56 AM, Amery Hung wrote:
> >>> Introduce bpf_selem_unlink_lockless() to properly handle errors returned
> >>> from rqspinlock in bpf_local_storage_map_free() and
> >>> bpf_local_storage_destroy() where the operation must succeeds.
> >>>
> >>> The idea of bpf_selem_unlink_lockless() is to allow an selem to be
> >>> partially linked and use refcount to determine when and who can free the
> >>> selem. An selem initially is fully linked to a map and a local storage
> >>> and therefore selem->link_cnt is set to 2. Under normal circumstances,
> >>> bpf_selem_unlink_lockless() will be able to grab locks and unlink
> >>> an selem from map and local storage in sequeunce, just like
> >>> bpf_selem_unlink(), and then add it to a local tofree list provide by
> >>> the caller. However, if any of the lock attempts fails, it will
> >>> only clear SDATA(selem)->smap or selem->local_storage depending on the
> >>> caller and decrement link_cnt to signal that the corresponding data
> >>> structure holding a reference to the selem is gone. Then, only when both
> >>> map and local storage are gone, an selem can be free by the last caller
> >>> that turns link_cnt to 0.
> >>>
> >>> To make sure bpf_obj_free_fields() is done only once and when map is
> >>> still present, it is called when unlinking an selem from b->list under
> >>> b->lock.
> >>>
> >>> To make sure uncharging memory is only done once and when owner is still
> >>> present, only unlink selem from local_storage->list in
> >>> bpf_local_storage_destroy() and return the amount of memory to uncharge
> >>> to the caller (i.e., owner) since the map associated with an selem may
> >>> already be gone and map->ops->map_local_storage_uncharge can no longer
> >>> be referenced.
> >>>
> >>> Finally, access of selem, SDATA(selem)->smap and selem->local_storage
> >>> are racy. Callers will protect these fields with RCU.
> >>>
> >>> Co-developed-by: Martin KaFai Lau <martin.lau@...nel.org>
> >>> Signed-off-by: Martin KaFai Lau <martin.lau@...nel.org>
> >>> Signed-off-by: Amery Hung <ameryhung@...il.com>
> >>> ---
> >>>    include/linux/bpf_local_storage.h |  2 +-
> >>>    kernel/bpf/bpf_local_storage.c    | 77 +++++++++++++++++++++++++++++--
> >>>    2 files changed, 74 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> >>> index 20918c31b7e5..1fd908c44fb6 100644
> >>> --- a/include/linux/bpf_local_storage.h
> >>> +++ b/include/linux/bpf_local_storage.h
> >>> @@ -80,9 +80,9 @@ struct bpf_local_storage_elem {
> >>>                                                 * after raw_spin_unlock
> >>>                                                 */
> >>>        };
> >>> +     atomic_t link_cnt;
> >>>        u16 size;
> >>>        bool use_kmalloc_nolock;
> >>> -     /* 4 bytes hole */
> >>>        /* The data is stored in another cacheline to minimize
> >>>         * the number of cachelines access during a cache hit.
> >>>         */
> >>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> >>> index 62201552dca6..4c682d5aef7f 100644
> >>> --- a/kernel/bpf/bpf_local_storage.c
> >>> +++ b/kernel/bpf/bpf_local_storage.c
> >>> @@ -97,6 +97,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> >>>                        if (swap_uptrs)
> >>>                                bpf_obj_swap_uptrs(smap->map.record, SDATA(selem)->data, value);
> >>>                }
> >>> +             atomic_set(&selem->link_cnt, 2);
> >>>                selem->size = smap->elem_size;
> >>>                selem->use_kmalloc_nolock = smap->use_kmalloc_nolock;
> >>>                return selem;
> >>> @@ -200,9 +201,11 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
> >>>        /* The bpf_local_storage_map_free will wait for rcu_barrier */
> >>>        smap = rcu_dereference_check(SDATA(selem)->smap, 1);
> >>>
> >>> -     migrate_disable();
> >>> -     bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> -     migrate_enable();
> >>> +     if (smap) {
> >>> +             migrate_disable();
> >>> +             bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> +             migrate_enable();
> >>> +     }
> >>>        kfree_nolock(selem);
> >>>    }
> >>>
> >>> @@ -227,7 +230,8 @@ void bpf_selem_free(struct bpf_local_storage_elem *selem,
> >>>                 * is only supported in task local storage, where
> >>>                 * smap->use_kmalloc_nolock == true.
> >>>                 */
> >>> -             bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> +             if (smap)
> >>> +                     bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>>                __bpf_selem_free(selem, reuse_now);
> >>>                return;
> >>>        }
> >>> @@ -419,6 +423,71 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
> >>>        return err;
> >>>    }
> >>>
> >>> +/* Callers of bpf_selem_unlink_lockless() */
> >>> +#define BPF_LOCAL_STORAGE_MAP_FREE   0
> >>> +#define BPF_LOCAL_STORAGE_DESTROY    1
> >>> +
> >>> +/*
> >>> + * Unlink an selem from map and local storage with lockless fallback if callers
> >>> + * are racing or rqspinlock returns error. It should only be called by
> >>> + * bpf_local_storage_destroy() or bpf_local_storage_map_free().
> >>> + */
> >>> +static void bpf_selem_unlink_lockless(struct bpf_local_storage_elem *selem,
> >>> +                                   struct hlist_head *to_free, int caller)
> >>> +{
> >>> +     struct bpf_local_storage *local_storage;
> >>> +     struct bpf_local_storage_map_bucket *b;
> >>> +     struct bpf_local_storage_map *smap;
> >>> +     unsigned long flags;
> >>> +     int err, unlink = 0;
> >>> +
> >>> +     local_storage = rcu_dereference_check(selem->local_storage, bpf_rcu_lock_held());
> >>> +     smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held());
> >>> +
> >>> +     /*
> >>> +      * Free special fields immediately as SDATA(selem)->smap will be cleared.
> >>> +      * No BPF program should be reading the selem.
> >>> +      */
> >>> +     if (smap) {
> >>> +             b = select_bucket(smap, selem);
> >>> +             err = raw_res_spin_lock_irqsave(&b->lock, flags);
> >>> +             if (!err) {
> >>> +                     if (likely(selem_linked_to_map(selem))) {
> >>> +                             hlist_del_init_rcu(&selem->map_node);
> >>> +                             bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> +                             RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
> >>> +                             unlink++;
> >>> +                     }
> >>> +                     raw_res_spin_unlock_irqrestore(&b->lock, flags);
> >>> +             } else if (caller == BPF_LOCAL_STORAGE_MAP_FREE) {
> >>> +                     RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
> >>
> >> I suspect I am missing something obvious, so it will be faster to ask here.
> >>
> >> I could see why init NULL can work if it could assume the map_free
> >> caller always succeeds in the b->lock, the "if (!err)" path above.
> >>
> >> If the b->lock did fail here for the map_free caller, it reset
> >> SDATA(selem)->smap to NULL here. Who can do the bpf_obj_free_fields() in
> >> the future?
> >
> > I think this can only mean destroy() is holding the lock, and
> > destroy() should do bpf_selem_unlink_map_nolock(). Though I am not
>
> hmm... instead of bpf_selem_unlink_map_nolock(), do you mean the "if
> (!err)" path in this bpf_selem_unlink_lockless() function? or we are
> talking different things?

Ah yes. Sorry for the confusion.

>
> [ btw, a nit, I think it can use a better function name instead of
> "lockless". This function still takes the lock if it can. ]

Does using _nofail suffix make it more clear?

>
> > sure how destroy() can hold b->lock in a way that causes map_free() to
> > fail acquiring b->lock.
>
> I recall ETIMEDOUT was mentioned to be the likely (only?) case here.
> Assume the b->lock did fail in map_free here, there are >1 selem(s)
> using the same b->lock. Is it always true that the selem that failed at
> the b->lock in map_free() here must race with the very same selem in
> destroy()?
>
> >
> >>
> >>> +             }
> >>> +     }
> >>> +
> >>> +     /*
> >>> +      * Only let destroy() unlink from local_storage->list and do mem_uncharge
> >>> +      * as owner is guaranteed to be valid in destroy().
> >>> +      */
> >>> +     if (local_storage && caller == BPF_LOCAL_STORAGE_DESTROY) {
> >>
> >> If I read here correctly, only bpf_local_storage_destroy() can do the
> >> bpf_selem_free(). For example, if a bpf_sk_storage_map is going away,
> >> the selem (which is memcg charged) will stay in the sk until the sk is
> >> closed?
> >
> > You read it correctly and Yes there will be stale elements in
> > local_storage->list.
> >
> > I would hope the unlink from local_storage part is doable from
> > map_free() and destroy(), but it is hard to guarantee (1) mem_uncharge
> > is done only once (2) while the owner is still valid in a lockless
> > way.
>
> This needs to be addressed. It cannot leave the selem lingering. At
> least the selem should be freed for the common case (i.e., succeeds in
> both locks). Lingering selem is ok in case of lock failure. Many sk can
> be long-lived connections. The user space may want to recreate the map,
> and it will be limited by the memcg.

I think this is doable by maintaining a local memory charge in local storage.

So, remove selem->size and have a copy of total selem memory charge in
a new local_storage->selem_size. Update will be protected by
local_storage->lock in common paths (not in bpf_selem_unlink_nofail).
More specifically, charge in bpf_selem_link_storage_nolock() when a
selem is going to be publicized. uncharge in
bpf_selem_unlink_storage_nolock() when a selem is being deleted. Then,
in destroy() we simply get the total amount to be uncharged from the
owner from local_storage->selem_size.

WDYT?

>
> >
> >>
> >>> +             err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
> >>> +             if (!err) {
> >>> +                     hlist_del_init_rcu(&selem->snode);
> >>> +                     unlink++;
> >>> +                     raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
> >>> +             }
> >>> +             RCU_INIT_POINTER(selem->local_storage, NULL);
> >>> +     }
> >>> +
> >>> +     /*
> >>> +      * Normally, an selem can be unlink under local_storage->lock and b->lock, and
> >>> +      * then added to a local to_free list. However, if destroy() and map_free() are
> >>> +      * racing or rqspinlock returns errors in unlikely situations (unlink != 2), free
> >>> +      * the selem only after both map_free() and destroy() drop the refcnt.
> >>> +      */
> >>> +     if (unlink == 2 || atomic_dec_and_test(&selem->link_cnt))
> >>> +             hlist_add_head(&selem->free_node, to_free);
> >>> +}
> >>> +
> >>>    void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
> >>>                                      struct bpf_local_storage_map *smap,
> >>>                                      struct bpf_local_storage_elem *selem)
> >>
>