[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMB2axNtrFEL0x+j6M3De-ezR38cv5s7LFtpuAeN7QCf_AyHaQ@mail.gmail.com>
Date: Mon, 12 Jan 2026 14:38:13 -0800
From: Amery Hung <ameryhung@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: netdev@...r.kernel.org, alexei.starovoitov@...il.com, andrii@...nel.org,
daniel@...earbox.net, memxor@...il.com, martin.lau@...nel.org,
kpsingh@...nel.org, yonghong.song@...ux.dev, song@...nel.org,
haoluo@...gle.com, kernel-team@...a.com, bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when
freeing map or local storage
On Fri, Jan 9, 2026 at 1:38 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
>
>
> On 1/9/26 12:47 PM, Amery Hung wrote:
> > On Fri, Jan 9, 2026 at 12:16 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >> On 12/18/25 9:56 AM, Amery Hung wrote:
> >>> Introduce bpf_selem_unlink_lockless() to properly handle errors returned
> >>> from rqspinlock in bpf_local_storage_map_free() and
> >>> bpf_local_storage_destroy() where the operation must succeeds.
> >>>
> >>> The idea of bpf_selem_unlink_lockless() is to allow an selem to be
> >>> partially linked and use refcount to determine when and who can free the
> >>> selem. An selem initially is fully linked to a map and a local storage
> >>> and therefore selem->link_cnt is set to 2. Under normal circumstances,
> >>> bpf_selem_unlink_lockless() will be able to grab locks and unlink
> >>> an selem from map and local storage in sequeunce, just like
> >>> bpf_selem_unlink(), and then add it to a local tofree list provide by
> >>> the caller. However, if any of the lock attempts fails, it will
> >>> only clear SDATA(selem)->smap or selem->local_storage depending on the
> >>> caller and decrement link_cnt to signal that the corresponding data
> >>> structure holding a reference to the selem is gone. Then, only when both
> >>> map and local storage are gone, an selem can be free by the last caller
> >>> that turns link_cnt to 0.
> >>>
> >>> To make sure bpf_obj_free_fields() is done only once and when map is
> >>> still present, it is called when unlinking an selem from b->list under
> >>> b->lock.
> >>>
> >>> To make sure uncharging memory is only done once and when owner is still
> >>> present, only unlink selem from local_storage->list in
> >>> bpf_local_storage_destroy() and return the amount of memory to uncharge
> >>> to the caller (i.e., owner) since the map associated with an selem may
> >>> already be gone and map->ops->map_local_storage_uncharge can no longer
> >>> be referenced.
> >>>
> >>> Finally, access of selem, SDATA(selem)->smap and selem->local_storage
> >>> are racy. Callers will protect these fields with RCU.
> >>>
> >>> Co-developed-by: Martin KaFai Lau <martin.lau@...nel.org>
> >>> Signed-off-by: Martin KaFai Lau <martin.lau@...nel.org>
> >>> Signed-off-by: Amery Hung <ameryhung@...il.com>
> >>> ---
> >>> include/linux/bpf_local_storage.h | 2 +-
> >>> kernel/bpf/bpf_local_storage.c | 77 +++++++++++++++++++++++++++++--
> >>> 2 files changed, 74 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> >>> index 20918c31b7e5..1fd908c44fb6 100644
> >>> --- a/include/linux/bpf_local_storage.h
> >>> +++ b/include/linux/bpf_local_storage.h
> >>> @@ -80,9 +80,9 @@ struct bpf_local_storage_elem {
> >>> * after raw_spin_unlock
> >>> */
> >>> };
> >>> + atomic_t link_cnt;
> >>> u16 size;
> >>> bool use_kmalloc_nolock;
> >>> - /* 4 bytes hole */
> >>> /* The data is stored in another cacheline to minimize
> >>> * the number of cachelines access during a cache hit.
> >>> */
> >>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> >>> index 62201552dca6..4c682d5aef7f 100644
> >>> --- a/kernel/bpf/bpf_local_storage.c
> >>> +++ b/kernel/bpf/bpf_local_storage.c
> >>> @@ -97,6 +97,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> >>> if (swap_uptrs)
> >>> bpf_obj_swap_uptrs(smap->map.record, SDATA(selem)->data, value);
> >>> }
> >>> + atomic_set(&selem->link_cnt, 2);
> >>> selem->size = smap->elem_size;
> >>> selem->use_kmalloc_nolock = smap->use_kmalloc_nolock;
> >>> return selem;
> >>> @@ -200,9 +201,11 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
> >>> /* The bpf_local_storage_map_free will wait for rcu_barrier */
> >>> smap = rcu_dereference_check(SDATA(selem)->smap, 1);
> >>>
> >>> - migrate_disable();
> >>> - bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> - migrate_enable();
> >>> + if (smap) {
> >>> + migrate_disable();
> >>> + bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> + migrate_enable();
> >>> + }
> >>> kfree_nolock(selem);
> >>> }
> >>>
> >>> @@ -227,7 +230,8 @@ void bpf_selem_free(struct bpf_local_storage_elem *selem,
> >>> * is only supported in task local storage, where
> >>> * smap->use_kmalloc_nolock == true.
> >>> */
> >>> - bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> + if (smap)
> >>> + bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> __bpf_selem_free(selem, reuse_now);
> >>> return;
> >>> }
> >>> @@ -419,6 +423,71 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
> >>> return err;
> >>> }
> >>>
> >>> +/* Callers of bpf_selem_unlink_lockless() */
> >>> +#define BPF_LOCAL_STORAGE_MAP_FREE 0
> >>> +#define BPF_LOCAL_STORAGE_DESTROY 1
> >>> +
> >>> +/*
> >>> + * Unlink an selem from map and local storage with lockless fallback if callers
> >>> + * are racing or rqspinlock returns error. It should only be called by
> >>> + * bpf_local_storage_destroy() or bpf_local_storage_map_free().
> >>> + */
> >>> +static void bpf_selem_unlink_lockless(struct bpf_local_storage_elem *selem,
> >>> + struct hlist_head *to_free, int caller)
> >>> +{
> >>> + struct bpf_local_storage *local_storage;
> >>> + struct bpf_local_storage_map_bucket *b;
> >>> + struct bpf_local_storage_map *smap;
> >>> + unsigned long flags;
> >>> + int err, unlink = 0;
> >>> +
> >>> + local_storage = rcu_dereference_check(selem->local_storage, bpf_rcu_lock_held());
> >>> + smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held());
> >>> +
> >>> + /*
> >>> + * Free special fields immediately as SDATA(selem)->smap will be cleared.
> >>> + * No BPF program should be reading the selem.
> >>> + */
> >>> + if (smap) {
> >>> + b = select_bucket(smap, selem);
> >>> + err = raw_res_spin_lock_irqsave(&b->lock, flags);
> >>> + if (!err) {
> >>> + if (likely(selem_linked_to_map(selem))) {
> >>> + hlist_del_init_rcu(&selem->map_node);
> >>> + bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> >>> + RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
> >>> + unlink++;
> >>> + }
> >>> + raw_res_spin_unlock_irqrestore(&b->lock, flags);
> >>> + } else if (caller == BPF_LOCAL_STORAGE_MAP_FREE) {
> >>> + RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
> >>
> >> I suspect I am missing something obvious, so it will be faster to ask here.
> >>
> >> I could see why init NULL can work if it could assume the map_free
> >> caller always succeeds in the b->lock, the "if (!err)" path above.
> >>
> >> If the b->lock did fail here for the map_free caller, it reset
> >> SDATA(selem)->smap to NULL here. Who can do the bpf_obj_free_fields() in
> >> the future?
> >
> > I think this can only mean destroy() is holding the lock, and
> > destroy() should do bpf_selem_unlink_map_nolock(). Though I am not
>
> hmm... instead of bpf_selem_unlink_map_nolock(), do you mean the "if
> (!err)" path in this bpf_selem_unlink_lockless() function? or we are
> talking different things?
Ah yes. Sorry for the confusion.
>
> [ btw, a nit, I think it can use a better function name instead of
> "lockless". This function still takes the lock if it can. ]
Does using _nofail suffix make it more clear?
>
> > sure how destroy() can hold b->lock in a way that causes map_free() to
> > fail acquiring b->lock.
>
> I recall ETIMEDOUT was mentioned to be the likely (only?) case here.
> Assume the b->lock did fail in map_free here, there are >1 selem(s)
> using the same b->lock. Is it always true that the selem that failed at
> the b->lock in map_free() here must race with the very same selem in
> destroy()?
>
> >
> >>
> >>> + }
> >>> + }
> >>> +
> >>> + /*
> >>> + * Only let destroy() unlink from local_storage->list and do mem_uncharge
> >>> + * as owner is guaranteed to be valid in destroy().
> >>> + */
> >>> + if (local_storage && caller == BPF_LOCAL_STORAGE_DESTROY) {
> >>
> >> If I read here correctly, only bpf_local_storage_destroy() can do the
> >> bpf_selem_free(). For example, if a bpf_sk_storage_map is going away,
> >> the selem (which is memcg charged) will stay in the sk until the sk is
> >> closed?
> >
> > You read it correctly and Yes there will be stale elements in
> > local_storage->list.
> >
> > I would hope the unlink from local_storage part is doable from
> > map_free() and destroy(), but it is hard to guarantee (1) mem_uncharge
> > is done only once (2) while the owner is still valid in a lockless
> > way.
>
> This needs to be addressed. It cannot leave the selem lingering. At
> least the selem should be freed for the common case (i.e., succeeds in
> both locks). Lingering selem is ok in case of lock failure. Many sk can
> be long-lived connections. The user space may want to recreate the map,
> and it will be limited by the memcg.
I think this is doable by maintaining a local memory charge in local storage.
So, remove selem->size and have a copy of total selem memory charge in
a new local_storage->selem_size. Update will be protected by
local_storage->lock in common paths (not in bpf_selem_unlink_nofail).
More specifically, charge in bpf_selem_link_storage_nolock() when a
selem is going to be publicized. uncharge in
bpf_selem_unlink_storage_nolock() when a selem is being deleted. Then,
in destroy() we simply get the total amount to be uncharged from the
owner from local_storage->selem_size.
WDYT?
>
> >
> >>
> >>> + err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
> >>> + if (!err) {
> >>> + hlist_del_init_rcu(&selem->snode);
> >>> + unlink++;
> >>> + raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
> >>> + }
> >>> + RCU_INIT_POINTER(selem->local_storage, NULL);
> >>> + }
> >>> +
> >>> + /*
> >>> + * Normally, an selem can be unlink under local_storage->lock and b->lock, and
> >>> + * then added to a local to_free list. However, if destroy() and map_free() are
> >>> + * racing or rqspinlock returns errors in unlikely situations (unlink != 2), free
> >>> + * the selem only after both map_free() and destroy() drop the refcnt.
> >>> + */
> >>> + if (unlink == 2 || atomic_dec_and_test(&selem->link_cnt))
> >>> + hlist_add_head(&selem->free_node, to_free);
> >>> +}
> >>> +
> >>> void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
> >>> struct bpf_local_storage_map *smap,
> >>> struct bpf_local_storage_elem *selem)
> >>
>
Powered by blists - more mailing lists