[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251218175628.1460321-1-ameryhung@gmail.com>
Date: Thu, 18 Dec 2025 09:56:10 -0800
From: Amery Hung <ameryhung@...il.com>
To: bpf@...r.kernel.org
Cc: netdev@...r.kernel.org,
alexei.starovoitov@...il.com,
andrii@...nel.org,
daniel@...earbox.net,
memxor@...il.com,
martin.lau@...nel.org,
kpsingh@...nel.org,
yonghong.song@...ux.dev,
song@...nel.org,
haoluo@...gle.com,
ameryhung@...il.com,
kernel-team@...a.com
Subject: [PATCH bpf-next v3 00/16] Remove task and cgroup local storage percpu counters
* Motivation *
The goal of this patchset is to make bpf syscalls and helpers updating
task and cgroup local storage more robust by removing percpu counters
in them. Task local storage and cgroup storage each employs a percpu
counter to prevent deadlock caused by recursion. Since the underlying
bpf local storage takes spinlocks in various operations, bpf programs
running recursively may try to take a spinlock which is already taken.
For example, when a tracing bpf program called recursively during
bpf_task_storage_get(..., F_CREATE) tries to call
bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
if the percpu variable is not in place.
However, sometimes, the percpu counter may cause bpf syscalls or helpers
to return errors spuriously, as soon as another threads is also updating
the local storage or the local storage map. Ideally, the two threads
could have taken turn to take the locks and perform their jobs
respectively. However, due to the percpu counter, the syscalls and
helpers can return -EBUSY even if one of them does not run recursively
in another one. All it takes for this to happen is if the two threads run
on the same CPU. This happened when BPF-CI ran the selftest of task local
data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
F_CREATE) can easily fail.
The failure mode is not good for users as they need to add retry logic
in user space or bpf programs to avoid it. Even with retry, there
is no guaranteed upper bound of the loop for a success call. Therefore,
this patchset seeks to remove the percpu counter and makes the related
bpf syscalls and helpers more reliable, while still make sure recursion
deadlock will not happen, with the help of resilient queued spinlock
(rqspinlock).
* Implementation *
To remove the percpu counter without introducing deadlock,
bpf_local_storage is refactored by changing the locks from raw_spin_lock
to rqspinlock, which prevents deadlock with deadlock detection and a
timeout mechanism.
The refactor basically repalces the locks with rqspinlock and propagates
errors returned by the locking function to BPF helpers or syscalls.
bpf_selem_unlink_lockless() is introduced to handle rqspinlock errors
in two lock acquiring functions that cannot fail,
bpf_local_storage_destroy() and bpf_local_storage_map_free()
(i.e., local storage is being freed by the subsystem or the map is
being freed).
If not familiar with local storage, the last section briefly describe
the locks and structure of local storage. It also shows the abbreviation
used in the rest of the letter.
* Test *
Task and cgroup local storage selftests have already covered deadlock
caused by recursion. Patch 10 updates the expected result of task local
storage selftests as task local storage bpf helpers can now run on the
same CPU as they don't cause deadlock.
* Patchset organization *
Patch 1-4 convert local storage internal helpers to failable.
Patch 5 changes the locks to rqspinlock and propagate the error
returned from raw_res_spin_lock_irqsave() to BPF heleprs and syscalls.
Temprarily WARN_ON() in map_free and destroy.
Patch 6-8 remove percpu counters in task and cgroup local storage.
Patch 9-11 address the unlikely rqspinlock errors by switching to
bpf_selem_unlink_lockless() in map_free and destory.
Patch 12-15 update selftests.
* Appendix: local storage internal *
There are two locks in bpf_local_storage due to the ownership model as
illustrated in the figure below. A map value, which consists of a
pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
stored in a bpf_local_storage_elem (selem). A selem belongs to a
bpf_local_storage and bpf_local_storage_map at the same time.
bpf_local_storage::lock (lock_storage->lock in short) protects the list
in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
protects the hash bucket in a bpf_local_storage_map.
task_struct
┌ task1 ───────┐ bpf_local_storage
│ *bpf_storage │---->┌─────────┐
└──────────────┘<----│ *owner │ bpf_local_storage_elem
│ *cache[16] (selem) selem
│ *smap │ ┌──────────┐ ┌──────────┐
│ list │------->│ snode │<------->│ snode │
│ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │
└─────────┘ │ │ sdata = │ │ │ │ sdata = │
task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │
┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘
│ *bpf_storage │---->┌─────────┐ │ │ │
└──────────────┘<----│ *owner │ │ │ │
│ *cache[16] │ selem │ │ selem
│ *smap │ │ ┌──────────┐ │ │ ┌──────────┐
│ list │--│---->│ snode │<--│-│-->│ snode │
│ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │
└─────────┘ │ │ │ sdata = │ │ │ sdata = │
bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │
(smap) │ │ └──────────┘ │ └──────────┘
┌ mapA ───────┐ │ │ │
│ bpf_map map │ bpf_local_storage_map_bucket │
│ *buckets │---->┌ b[0] ┐ │ │ │
└─────────────┘ │ list │------┘ │ │
│ lock │ │ │
└──────┘ │ │
smap ... │ │
┌ mapB ───────┐ │ │
│ bpf_map map │ bpf_local_storage_map_bucket │
│ *buckets │---->┌ b[0] ┐ │ │
└─────────────┘ │ list │--------┘ │
│ lock │ │
└──────┘ │
┌ b[1] ┐ │
│ list │-----------------------------┘
│ lock │
└──────┘
...
Changelog:
v2 -> v3
- Rebase to bpf-next where BPF memory allocator is replaced with
kmalloc_nolock()
- Revert to selecting bucket based on selem
- Introduce bpf_selem_unlink_lockless() to allow unlinking and
freeing selem without taking locks
Link: https://lore.kernel.org/bpf/20251002225356.1505480-1-ameryhung@gmail.com/
v1 -> v2
- Rebase to bpf-next
- Select bucket based on local_storage instead of selem (Martin)
- Simplify bpf_selem_unlink (Martin)
- Change handling of rqspinlock errors in bpf_local_storage_destroy()
and bpf_local_storage_map_free(). Retry instead of WARN_ON.
Link: https://lore.kernel.org/bpf/20250729182550.185356-1-ameryhung@gmail.com/
---
Amery Hung (16):
bpf: Convert bpf_selem_unlink_map to failable
bpf: Convert bpf_selem_link_map to failable
bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink
bpf: Convert bpf_selem_unlink to failable
bpf: Change local_storage->lock and b->lock to rqspinlock
bpf: Remove task local storage percpu counter
bpf: Remove cgroup local storage percpu counter
bpf: Remove unused percpu counter from bpf_local_storage_map_free
bpf: Save memory allocation method and size in bpf_local_storage_elem
bpf: Support lockless unlink when freeing map or local storage
bpf: Switch to bpf_selem_unlink_lockless in
bpf_local_storage_{map_free, destroy}
selftests/bpf: Update sk_storage_omem_uncharge test
selftests/bpf: Update task_local_storage/recursion test
selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
selftests/bpf: Remove test_task_storage_map_stress_lookup
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
include/linux/bpf_local_storage.h | 20 +-
kernel/bpf/bpf_cgrp_storage.c | 63 +---
kernel/bpf/bpf_inode_storage.c | 7 +-
kernel/bpf/bpf_local_storage.c | 311 +++++++++++++-----
kernel/bpf/bpf_task_storage.c | 155 ++-------
kernel/bpf/helpers.c | 4 -
net/core/bpf_sk_storage.c | 17 +-
.../bpf/map_tests/task_storage_map.c | 128 -------
.../selftests/bpf/prog_tests/btf_dump.c | 4 +-
.../bpf/prog_tests/task_local_storage.c | 8 +-
.../bpf/progs/read_bpf_task_storage_busy.c | 38 ---
.../bpf/progs/sk_storage_omem_uncharge.c | 12 +-
.../bpf/progs/task_storage_nodeadlock.c | 7 +-
13 files changed, 294 insertions(+), 480 deletions(-)
delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c
delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
--
2.47.3
Powered by blists - more mailing lists