[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMB2axM2qgsstBc+efRN4CU0V1-dMa7DemPkTHZ1Fhdku7o2HQ@mail.gmail.com>
Date: Tue, 29 Jul 2025 11:28:48 -0700
From: Amery Hung <ameryhung@...il.com>
To: bpf@...r.kernel.org
Cc: netdev@...r.kernel.org, alexei.starovoitov@...il.com, andrii@...nel.org,
daniel@...earbox.net, memxor@...il.com, kpsingh@...nel.org,
martin.lau@...nel.org, yonghong.song@...ux.dev, song@...nel.org,
haoluo@...gle.com, kernel-team@...a.com
Subject: Re: [RFC PATCH bpf-next v1 00/11] Remove task and cgroup local
The title is "Remove task and cgroup local storage percpu counters"
On Tue, Jul 29, 2025 at 11:25 AM Amery Hung <ameryhung@...il.com> wrote:
>
> * Motivation *
>
> The goal of this patchset is to make bpf syscalls and helpers updating
> task and cgroup local storage more robust by removing percpu counters
> in them. Task local storage and cgroup storage each employs a percpu
> counter to prevent deadlock caused by recursion. Since the underlying
> bpf local storage takes spinlocks in various operations, bpf programs
> running recursively may try to take a spinlock which is already taken.
> For example, when a tracing bpf program called recursively during
> bpf_task_storage_get(..., F_CREATE) tries to call
> bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
> if the percpu variable is not in place.
>
> However, sometimes, the percpu counter may cause bpf syscalls or helpers
> to return errors spuriously, as soon as another threads is also updating
> the local storage or the local storage map. Ideally, the two threads
> could have taken turn to take the locks and perform their jobs
> respectively. However, due to the percpu counter, the syscalls and
> helpers can return -EBUSY even if one of them does not run recursively
> in another one. All it takes for this to happen is if the two threads run
> on the same CPU. This happened when BPF-CI ran the selftest of task local
> data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
> F_CREATE) can easily fail.
>
> The failure mode is not good for users as they need to add retry logic
> in user space or bpf programs to avoid it. Even with retry, there
> is no guaranteed upper bound of the loop for a succeess call. Therefore,
> this patchset seeks to remove the percpu counter and makes the related
> bpf syscalls and helpers more reliable, while still make sure recursion
> deadlock will not happen, with the help of resilient queued spinlock
> (rqspinlock).
>
>
> * Implementation *
>
> To remove the percpu counter without introducing deadlock,
> bpf_local_storage is refactored by changing the locks from raw_spin_lock
> to rqspinlock, which prevents deadlock with deadlock detection and a
> timeout mechanism.
>
> There are two locks in bpf_local_storage due to the ownership model as
> illustrated in the figure below. A map value, which consists of a
> pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
> stored in a bpf_local_storage_elem (selem). A selem belongs to a
> bpf_local_storage and bpf_local_storage_map at the same time.
> bpf_local_storage::lock (lock_storage->lock in short) protects the list
> in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
> protects the hash bucket in a bpf_local_storage_map.
>
>
> task_struct
> ┌ task1 ───────┐ bpf_local_storage
> │ *bpf_storage │---->┌─────────┐
> └──────────────┘<----│ *owner │ bpf_local_storage_elem
> │ *cache[16] (selem) selem
> │ *smap │ ┌──────────┐ ┌──────────┐
> │ list │------->│ snode │<------->│ snode │
> │ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │
> └─────────┘ │ │ sdata = │ │ │ │ sdata = │
> task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │
> ┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘
> │ *bpf_storage │---->┌─────────┐ │ │ │
> └──────────────┘<----│ *owner │ │ │ │
> │ *cache[16] │ selem │ │ selem
> │ *smap │ │ ┌──────────┐ │ │ ┌──────────┐
> │ list │--│---->│ snode │<--│-│-->│ snode │
> │ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │
> └─────────┘ │ │ │ sdata = │ │ │ sdata = │
> bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │
> (smap) │ │ └──────────┘ │ └──────────┘
> ┌ mapA ───────┐ │ │ │
> │ bpf_map map │ bpf_local_storage_map_bucket │
> │ *buckets │---->┌ b[0] ┐ │ │ │
> └─────────────┘ │ list │------┘ │ │
> │ lock │ │ │
> └──────┘ │ │
> smap ... │ │
> ┌ mapB ───────┐ │ │
> │ bpf_map map │ bpf_local_storage_map_bucket │
> │ *buckets │---->┌ b[0] ┐ │ │
> └─────────────┘ │ list │--------┘ │
> │ lock │ │
> └──────┘ │
> ┌ b[1] ┐ │
> │ list │-----------------------------┘
> │ lock │
> └──────┘
> ...
>
>
> The refactoring is divided into three steps.
>
> First, in patch 1-4, local storage helpers that take locks are being
> converted to failable. The functions are changed from returning void to
> returning an int error code with the return value temporarily set to 0.
> In callers where the helpers cannot fail in the middle of an update,
> the helper is open coded. In callers that are not allowed to fail, (i.e.,
> bpf_local_storage_destroy() and bpf_local_storage_map_free(), we make
> sure the functions cannot be called recursively, causing deadlock, and
> assert the return value to be 0.
>
> Then, in patch 5, the locks are changed to rqspinlock, and the error
> returned from raw_res_spin_lock_irqsave() is propogated to the syscalls
> and heleprs.
>
> Finally, in patch 7-8, the percpu counters in task and cgroup local
> storage are removed.
>
> Question:
>
> - In bpf_local_storage_destroy() and bpf_local_storage_map_free(), where
> it is not allow to fail, I assert that the lock acquisition always
> succeeds based on the fact that 1) these paths cannot run recursively
> causing AA deadlock and 2) local_storage->lock and b->lock are always
> acquired in the same order, but I also notice that rqspinlock has
> a timeout fallback. Is this assertion an okay thing to do?
>
>
> * Test *
>
> Task and cgroup local storage selftests have already covered deadlock
> caused by recursion. Patch 9 updates the expected result of task local
> storage selftests as task local storage bpf helpers can now run on the
> same CPU as they don't cause deadlock.
>
> * Patch organization *
>
> E(exposed) L(local storage lock) B(bucket lock)
> EL __bpf_local_storage_insert_cache (skip cache update)
> ELB bpf_local_storage_destroy (cannot recur)
> ELB bpf_local_storage_map_free (cannot recur)
> ELB bpf_selem_unlink --> Patch 4
> E B bpf_selem_link_map --> Patch 2
> B bpf_selem_unlink_map --> Patch 1
> L bpf_selem_unlink_storage --> Patch 3
>
> During the refactoring, to make sure all exposed functions
> handle the error returned by raw_res_spin_lock_irqsave(),
> __must_check is added locally to catch all callers.
>
> Patch 1-4
> Convert local storage helpers to failable, or open-code
> the helpers
>
> Patch 5
> Change local_storage->lock and b->lock from
> raw_spin_lock to rqspinlock
>
> Patch 6
> Remove percpu lock in task local storage and remove
> bpf_task_storage_{get,delete}_recur()
>
> Patch 7
> Remove percpu lock in cgroup local storage
>
> Patch 8
> Remove percpu lock in bpf_local_storage
>
> Patch 9
> Update task local storage recursion test
>
> Patch 10
> Remove task local storage stress test
>
> Patch 11
> Update btf_dump to use another percpu variable
>
> ----
>
> Amery Hung (11):
> bpf: Convert bpf_selem_unlink_map to failable
> bpf: Convert bpf_selem_link_map to failable
> bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink
> bpf: Convert bpf_selem_unlink to failable
> bpf: Change local_storage->lock and b->lock to rqspinlock
> bpf: Remove task local storage percpu counter
> bpf: Remove cgroup local storage percpu counter
> bpf: Remove unused percpu counter from bpf_local_storage_map_free
> selftests/bpf: Update task_local_storage/recursion test
> selftests/bpf: Remove test_task_storage_map_stress_lookup
> selftests/bpf: Choose another percpu variable in bpf for btf_dump test
>
> include/linux/bpf_local_storage.h | 14 +-
> kernel/bpf/bpf_cgrp_storage.c | 60 +-----
> kernel/bpf/bpf_inode_storage.c | 6 +-
> kernel/bpf/bpf_local_storage.c | 202 ++++++++++++------
> kernel/bpf/bpf_task_storage.c | 153 ++-----------
> kernel/bpf/helpers.c | 4 -
> net/core/bpf_sk_storage.c | 10 +-
> .../bpf/map_tests/task_storage_map.c | 128 -----------
> .../selftests/bpf/prog_tests/btf_dump.c | 4 +-
> .../bpf/prog_tests/task_local_storage.c | 8 +-
> .../bpf/progs/read_bpf_task_storage_busy.c | 38 ----
> 11 files changed, 184 insertions(+), 443 deletions(-)
> delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c
> delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
>
> --
> 2.47.3
>
Powered by blists - more mailing lists