[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250730185903.3574598-1-ameryhung@gmail.com>
Date: Wed, 30 Jul 2025 11:58:51 -0700
From: Amery Hung <ameryhung@...il.com>
To: bpf@...r.kernel.org
Cc: netdev@...r.kernel.org,
alexei.starovoitov@...il.com,
andrii@...nel.org,
daniel@...earbox.net,
tj@...nel.org,
memxor@...il.com,
martin.lau@...nel.org,
linux-lists@...alapatis.com,
ameryhung@...il.com,
kernel-team@...a.com
Subject: [PATCH bpf-next v7 0/4] Task local data
v6 -> v7
- Fix typos and improve the clarity of the cover letter (Emil)
- Some variable naming changes to make it less confusing (Emil)
- Add retry in tld_obj_init() as bpf_task_storage_get_recur() with
BPF_LOCAL_STORAGE_GET_F_CREATE can fail if the task local storage
is also being modified by other threads on the same CPU. This can
be especially easy to trigger in CI VM that only has two CPUs.
Adding bpf_preempt_{disable,enable} around bpf_task_storage_get()
does not solve the problem as other threads competing for the percpu
counter lock in task local storage may not limit to bpf helpers.
Some may be calling bpf_task_storage_free() when threads exit.
There is no good max retry under heavy contention. For the 1st
selftest in this set, the max retry to guarantee success grows
linearly with the thread count. A proposal is to remove the percpu
counter by changing locks in bpf_local_storage to rqspinlock.
An alternative is to reduce the probability of failure by allowing
bpf syscall and struct_ops programs to use bpf_task_storage_get()
instead of the _recur version by modifying bpf_prog_check_recur().
This does not solve the problem in tracing programs though.
v6: https://lore.kernel.org/bpf/20250717164842.1848817-1-ameryhung@gmail.com/
* Overview *
This patchset introduces task local data, an abstract storage type
shared between user space and bpf programs for storing data specific to
each task, and implements a library to access it. The goal is to provide
an abstraction + implementation that is efficient in data sharing and
easy to use. The motivating use case is user space hinting with
sched_ext.
* Motivating use case *
CPU schedulers can potentially make a better decision with hints from
user space process. To support experimenting user space hinting with
sched_ext, there needs a mechanism to pass a "per-task hint" from user
space to the bpf scheduler "efficiently".
A bpf feature, UPTR [0], supported by task local storage is introduced
to serve the needs. It allows pinning a user space page to the kernel
through a special field in task local storage map. This patchset further
builds an abstraction on top of it to allow user space and bpf programs
to easily share per-task data.
* Design *
Task local data is built on top of task local storage map and UPTR to
achieve fast per-task data sharing. UPTR is a type of special field
supported in task local storage map value. A user page assigned to a UPTR
will be pinned to the kernel when the map is updated. Therefore, user
space programs can update data that will be seen by bpf programs without
syscalls.
Additionally, unlike most bpf maps, task local data does not require a
static map value definition. This design is driven by sched_ext, which
would like to allow multiple developers to share a storage without the
need to explicitly agree on the layout of it. While a centralized layout
definition would have worked, the friction of synchronizing it across
different repos is not desirable. This simplify code base management and
makes experimenting easier.
* Introduction to task local data library *
Task local data library provides simple APIs for user space and bpf
through two header files, task_local_data.h and task_local_data.bpf.h,
respectively. The diagram below illustrates the basic usage.
First, an entry of data in task local data, we call it a TLD, needs to
be defined in the user space through TLD_DEFINE_KEY() with information
including the size and the name. The macro will define a global variable
key of opaque type tld_key_t associated with the TLD and initialize it.
Then, the user space program can locate the TLD by passing the key to
tld_get_data() in different thread, where fd is the file descriptor of
the underlying task local storage map. The API returns a pointer to the
TLD specific to the calling thread and will remain valid until the
thread exits. Finally, user space programs can directly read/write the
TLD without bpf syscalls.
To use task local storage on the bpf side, struct tld_keys, needs to
be defined first. The structure that will be used to create tld_key_map
should contain the keys to the TLDs used in a bpf program. In the bpf
program, tld_init_object() first needs to be called to initialize a
struct tld_object variable on the stack. Then, tld_get_data() can be
called to get a pointer to the TLD similar to the user space. The API
will use the name to lookup task local data and cache the key in task
local storage map, tld_key_map, to speed up subsequent access. The size
of the TLD is also needed to prevent out-of-bound access in bpf.
┌─ Application ───────────────────────────────────────────────────────┐
│ TLD_DEFINE_KEY(kx, "X", 4); ┌─ library A ─────────────────────┐│
│ │ void func(...) ││
│ int main(...) │ { ││
│ { │ tld_key_t ky; ││
│ int *x; │ bool *y; ││
│ │ ││
│ x = tld_get_data(fd, kx); │ ky = tld_create_key("Y", 1);││
│ if (x) *x = 123; │ y = tld_get_data(fd, ky); ││
│ ┌────────┤ if (y) *y = true; ││
│ │ └─────────────────────────────────┘│
└───────┬─────────────────│───────────────────────────────────────────┘
V V
+ ─ Task local data ─ ─ ─ ─ ─ + ┌─ BPF program ──────────────────────┐
| ┌─ tld_data_map ──────────┐ | │ struct tld_object obj; │
| │ BPF Task local storage │ | │ bool *y; │
| │ │ | │ int *x; │
| │ tld_data_u __uptr *data │ | │ │
| │ tld_meta_u __uptr *meta │ | │ if (tld_init_object(task, &obj)) │
| └─────────────────────────┘ | │ return 0; │
| ┌─ tld_key_map ───────────┐ | │ │
| │ BPF Task local storage │ | │ x = tld_get_data(&obj, kx, "X", 4);│
| │ │ |<─┤ if (x) /* do something */ │
| │ tld_key_t kx; │ | │ │
| │ tld_key_t ky; │ | │ y = tld_get_data(&obj, ky, "Y", 1);│
| └─────────────────────────┘ | │ if (y) /* do something */ │
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ + └────────────────────────────────────┘
* Implementation *
Task local data defines the storage to be a task local storage map with
two UPTRs pointing to tld_data_u and tld_meta_u. tld_data_u, individual
to each thread, contains TLD data and the starting offset of data in a
page. tld_meta_u, shared by threads in a process, consists of the TLD
counts, total size of TLDs and tld_metadata for each TLD. tld_metadata
contains the name and the size of a TLD.
struct tld_data_u {
u64 start;
char data[PAGE_SIZE - sizeof(u64)];
};
struct tld_meta_u {
u8 cnt;
u16 size;
struct metadata metadata[TLD_DATA_CNT];
};
Both user space and bpf API follow the same protocol when accessing
task local data. A pointer to a TLD is located using a key. The key is
effectively the offset of a TLD in tld_data_u->data. To define a new
TLD, the user space API TLD_DEFINE_KEY() iterates tld_meta_u->metadata
until an empty slot is found and then update it. It also adds up sizes
of prior TLDs to derive the offset (i.e., key). Then, to locate a TLD,
tld_get_data() can simply return tld_data_u->data + offset.
To locate a TLD in bpf programs, an API with the same name as the user
space, tld_get_data() can be called. It takes more effort in the first
invocation as it fetches the key by name. Internal helper,
__tld_fetch_key() will iterate tld_meta_u->metadata until the name is
found. Then, the offset is cached as key in another task local storage
map, tld_key_map. When the search fails, the current TLD count is
cached instead to skip searching metadata entries that has been searched
in future invocation. The detail of task local data operations can be
found in patch 1.
* Misc *
The metadata can potentially use run-length encoding for names to reduce
memory wastage and support save more TLDs. I have a version that works,
but the selftest takes a bit longer to finish. More investigation needed
to find the root cause. I will save this for the future when there is a
need to store more than 63 TLDs.
[0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
---
v5 -> v6
- Address Andrii's comment
- Fix verification failure in no_alu32
- Some cleanup
v5: https://lore.kernel.org/bpf/20250627233958.2602271-1-ameryhung@gmail.com/
v4 -> v5
- Add an option to free memory on thread exit to prevent memory leak
- Add an option to reduce memory waste if the allocator can
use just enough memory to fullfill aligned_alloc() (e.g., glibc)
- Tweak bpf API
- Remove tld_fetch_key() as it does not work in init_tasl
- tld_get_data() now tries to fetch key if it is not cached yet
- Optimize bpf side tld_get_data()
- Faster fast path
- Less code
- Use stdatomic.h in user space library with seq_cst order
- Introduce TLD_DEFINE_KEY() as the default TLD creation API for
easier memory management.
- TLD_DEFINE_KEY() can consume memory up to a page and no memory
is wasted since their size is known before per-thread data
allocation.
- tld_create_key() can only use up to TLD_DYN_DATA_SIZE. Since
tld_create_key can run any time even after per-thread data
allocation, it is impossible to predict the total size. A
configurable size of memory is allocated on top of the total
size of TLD_DEFINE_KEY() to accommodate dynamic key creation.
- Add tld prefix to all macros
- Replace map_update(NO_EXIST) in __tld_init_data() with cmpxchg()
- No more +1,-1 dance on the bpf side
- Reduce printf from ASSERT in race test
- Try implementing run-length encoding for name and decide to
save it for the future
v4: https://lore.kernel.org/bpf/20250515211606.2697271-1-ameryhung@gmail.com/
v3 -> v4
- API improvements
- Simplify API
- Drop string obfuscation
- Use opaque type for key
- Better documentation
- Implementation
- Switch to dynamic allocation for per-task data
- Now offer as header-only libraries
- No TLS map pinning; leave it to users
- Drop pthread dependency
- Add more invalid tld_create_key() test
- Add a race test for tld_create_key()
v3: https://lore.kernel.org/bpf/20250425214039.2919818-1-ameryhung@gmail.com/
---
Amery Hung (4):
bpf: Allow syscall bpf programs to call non-recur helpers
selftests/bpf: Introduce task local data
selftests/bpf: Test basic task local data operations
selftests/bpf: Test concurrent task local data key creation
include/linux/bpf_verifier.h | 1 +
.../bpf/prog_tests/task_local_data.h | 386 ++++++++++++++++++
.../bpf/prog_tests/test_task_local_data.c | 297 ++++++++++++++
.../selftests/bpf/progs/task_local_data.bpf.h | 237 +++++++++++
.../bpf/progs/test_task_local_data.c | 65 +++
5 files changed, 986 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.bpf.h
create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data.c
--
2.47.3
Powered by blists - more mailing lists