[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250627233958.2602271-1-ameryhung@gmail.com>
Date: Fri, 27 Jun 2025 16:39:54 -0700
From: Amery Hung <ameryhung@...il.com>
To: bpf@...r.kernel.org
Cc: netdev@...r.kernel.org,
alexei.starovoitov@...il.com,
andrii@...nel.org,
daniel@...earbox.net,
tj@...nel.org,
memxor@...il.com,
martin.lau@...nel.org,
ameryhung@...il.com,
kernel-team@...a.com
Subject: [PATCH bpf-next v5 0/3] Task local data
* Motivation *
CPU schedulers can potentially make a better decision with hints from
user space process. To support experimenting user space hinting with
sched_ext, there needs a mechanism to pass a "per-task hint" from user
space to the bpf scheduler "efficiently".
The proposed mechanism is task local data. Similar to pthread key or
__thread, it allows users to define thread-specific data. In addition,
the user pages that back task local data are pinned to the kernel to
share with bpf programs directly. As a result, user space programs
can directly update per-thread hints, and then bpf program can read
the hint with little overhead. The diagram in the design section gives
a sneak peek of how it works.
* Overview *
Task local data defines an abstract storage type for storing data specific
to each task and provides user space and bpf libraries to access it. The
result is a fast and easy way to share per-task data between user space
and bpf programs. The intended use case is sched_ext, where user space
programs will pass hints to sched_ext bpf programs to affect task
scheduling.
Task local data is built on top of task local storage map and UPTR[0]
to achieve fast per-task data sharing. UPTR is a type of special field
supported in task local storage map value. A user page assigned to a UPTR
will be pinned by the kernel when the map is updated. Therefore, user
space programs can update data seen by bpf programs without syscalls.
Additionally, unlike most bpf maps, task local data does not require a
static map value definition. This design is driven by sched_ext, which
would like to allow multiple developers to share a storage without the
need to explicitly agree on the layout of it. While a centralized layout
definition would have worked, the friction of synchronizing it across
different repos is not desirable. This simplify code base management and
makes experimenting easier.
In the rest of the cover letter, "task local data" is used to refer to
the abstract storage and TLD is used to denote a single data entry in
the storage.
* Design *
Task local data library provides simple APIs for user space and bpf
through two header files, task_local_data.h and task_loca_data.bpf.h,
respectively. The usage is illustrated in the following diagram.
An entry of data in the task local data, TLD, first needs to be defined
with TLD_DEFINE_KEY() with the size of the data and a name associated with
the data. The macro defines and initialize an opaque key object of
tld_key_t type, which can be used to locate a TLD. The same key may be
passed to tld_get_data() in different threads, and a pointer to data
specific to the calling thread will be returned. The pointer will
remain valid until the process terminates, so there is not need to call
tld_get_data() in subsequent accesses.
TLD_DEFINE_KEY() is allowed to define TLDs up to roughly a page. In the
case when a TLD can only be known and created on the fly,
tld_create_key() can be called. Since the total TLD size cannot be known
beforehand, a memory of size TLD_DYN_DATA_SIZE is allocated for each
thread to accommodate them.
On the bpf side, programs will use also use tld_get_data() to locate
TLDs. The arugments contain a name and a key to a TLD. The name is
used for the first tld_get_data() to a TLD, which will lookup the TLD
by name and save the corresponding key to a task local data map,
tld_key_map. The map value type, struct tld_keys, __must__ be defined by
developers. It should contain keys used in the compilation unit.
┌─ Application ───────────────────────────────────────────────────────┐
│ TLD_DEFINE_KEY(kx, "X", 4); ┌─ library A ─────────────────────┐│
│ │ void func(...) ││
│ int main(...) │ { ││
│ { │ tld_key_t ky; ││
│ int *x; │ bool *y; ││
│ │ ││
│ x = tld_get_data(fd, kx); │ ky = tld_create_key("Y", 1);││
│ if (x) *x = 123; │ y = tld_get_data(fd, ky); ││
│ ┌────────┤ if (y) *y = true; ││
│ │ └─────────────────────────────────┘│
└───────┬─────────────────│───────────────────────────────────────────┘
V V
+ ─ Task local data ─ ─ ─ ─ ─ + ┌─ BPF program ──────────────────────┐
| ┌─ tld_data_map ──────────┐ | │ struct tld_object obj; │
| │ BPF Task local storage │ | │ bool *y; │
| │ │ | │ int *x; │
| │ __uptr *data │ | │ │
| │ __uptr *metadata │ | │ if (tld_init_object(task, &obj)) │
| └─────────────────────────┘ | │ return 0; │
| ┌─ tld_key_map ───────────┐ | │ │
| │ BPF Task local storage │ | │ x = tld_get_data(&obj, kx, "X", 4);│
| │ │ |<─┤ if (x) /* do something */ │
| │ tld_key_t kx; │ | │ │
| │ tld_key_t ky; │ | │ y = tld_get_data(&obj, ky, "Y", 1);│
| └─────────────────────────┘ | │ if (y) /* do something */ │
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ + └────────────────────────────────────┘
* Implementation *
Task local data defines the storage to be a task local storage map with
two UPTRs, data and metadata. Data points to a blob of memory for storing
TLDs individual to every task with the offset of data in a page. Metadata,
individual to each process and shared by its threads, records the total
number and size of TLDs and the metadata of each TLD. Metadata for a
TLD contains the key name and the size of the TLD.
struct u_tld_data {
u64 start;
char data[PAGE_SIZE - 8];
};
struct u_tld_metadata {
u8 cnt;
u16 size;
struct metadata data[TLD_DATA_CNT];
};
Both user space and bpf API follow the same protocol when accessing
task local data. A pointer to a TLD is located by a key. tld_key_t
effectively is the offset of a TLD in data. To add a TLD, user space
API, loops through metadata->data until an empty slot is found and update
it. It also adds sizes of prior TLDs along the way to derive the offset.
To locate a TLD in bpf when the first time tld_get_data() is called,
__tld_fetch_key() also loops through metadata->data until the name is
found. The offset is also derived by adding sizes. When the TLD is not
found, the current TLD count is cached instead to skip name comparison
that has been done. The detail of task local data operations can be found
in patch 1.
* Misc *
The metadata can potentially use run-length encoding for names to reduce
memory wastage and support save more TLDs. I have a version that works,
but the selftest takes a bit longer to finish. More investigation needed
to find the root cause. I will save this for the future when there is a
need to store more than 63 TLDs.
[0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
---
v4 -> v5
- Add an option to free memory on thread exit to prevent memory leak
- Add an option to reduce memory waste if the allocator can
use just enough memory to fullfill aligned_alloc() (e.g., glibc)
- Tweak bpf API
- Remove tld_fetch_key() as it does not work in init_tasl
- tld_get_data() now tries to fetch key if it is not cached yet
- Optimize bpf side tld_get_data()
- Faster fast path
- Less code
- Use stdatomic.h in user space library with seq_cst order
- Introduce TLD_DEFINE_KEY() as the default TLD creation API for
easier memory management.
- TLD_DEFINE_KEY() can consume memory up to a page and no memory
is wasted since their size is known before per-thread data
allocation.
- tld_create_key() can only use up to TLD_DYN_DATA_SIZE. Since
tld_create_key can run any time even after per-thread data
allocation, it is impossible to predict the total size. A
configurable size of memory is allocated on top of the total
size of TLD_DEFINE_KEY() to accommodate dynamic key creation.
- Add tld prefix to all macros
- Replace map_update(NO_EXIST) in __tld_init_data() with cmpxchg()
- No more +1,-1 dance on the bpf side
- Reduce printf from ASSERT in race test
- Try implementing run-length encoding for name and decide to
save it for the future
V4: https://lore.kernel.org/bpf/20250515211606.2697271-1-ameryhung@gmail.com/
v3 -> v4
- API improvements
- Simplify API
- Drop string obfuscation
- Use opaque type for key
- Better documentation
- Implementation
- Switch to dynamic allocation for per-task data
- Now offer as header-only libraries
- No TLS map pinning; leave it to users
- Drop pthread dependency
- Add more invalid tld_create_key() test
- Add a race test for tld_create_key()
v3: https://lore.kernel.org/bpf/20250425214039.2919818-1-ameryhung@gmail.com/
Amery Hung (3):
selftests/bpf: Introduce task local data
selftests/bpf: Test basic task local data operations
selftests/bpf: Test concurrent task local data key creation
.../bpf/prog_tests/task_local_data.h | 397 ++++++++++++++++++
.../bpf/prog_tests/test_task_local_data.c | 294 +++++++++++++
.../selftests/bpf/progs/task_local_data.bpf.h | 232 ++++++++++
.../bpf/progs/test_task_local_data.c | 65 +++
4 files changed, 988 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.bpf.h
create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data.c
--
2.47.1
Powered by blists - more mailing lists