[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMB2axP-iT=rR+Ks9UbExQ+sRJ_6kjF6+b1CMYKvHMhHR2_iOw@mail.gmail.com>
Date: Tue, 29 Jul 2025 14:54:38 -0700
From: Amery Hung <ameryhung@...il.com>
To: Emil Tsalapatis <linux-lists@...alapatis.com>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org, alexei.starovoitov@...il.com,
andrii@...nel.org, daniel@...earbox.net, tj@...nel.org, memxor@...il.com,
martin.lau@...nel.org, kernel-team@...a.com
Subject: Re: [PATCH bpf-next v6 0/3] Task local data
On Mon, Jul 28, 2025 at 6:54 AM Emil Tsalapatis
<linux-lists@...alapatis.com> wrote:
>
> Some feedback on the cover letter:
>
> On Thu, Jul 17, 2025 at 12:48 PM Amery Hung <ameryhung@...il.com> wrote:
> >
> > * Motivation *
> >
> > CPU schedulers can potentially make a better decision with hints from
> > user space process. To support experimenting user space hinting with
> > sched_ext, there needs a mechanism to pass a "per-task hint" from user
> > space to the bpf scheduler "efficiently".
> >
>
> I think this is underselling the feature a bit, isn't it useful for
> all BPF users?
> Otherwise why isn't it in the scx repo?
Task local data indeed is a generic mechanism to share per-task data
between user space and bpf.
The v6 cover letter referenced scx to give readers a more concrete
idea where this patchset comes from and how it will work end to end.
I will make the motivation more centered on TLD as a per-task data
sharing mechanism.
>
> > The proposed mechanism is task local data. Similar to pthread key or
> > __thread, it allows users to define thread-specific data. In addition,
> > the user pages that back task local data are pinned to the kernel to
> > share with bpf programs directly. As a result, user space programs
> > can directly update per-thread hints, and then bpf program can read
> > the hint with little overhead. The diagram in the design section gives
> > a sneak peek of how it works.
> >
> >
> > * Overview *
> >
> > Task local data defines an abstract storage type for storing data specific
> > to each task and provides user space and bpf libraries to access it. The
> > result is a fast and easy way to share per-task data between user space
> > and bpf programs. The intended use case is sched_ext, where user space
> > programs will pass hints to sched_ext bpf programs to affect task
> > scheduling.
> >
>
> Same as above, why is this a sched_ext feature? Maybe some of the design
> choices are informed by it, but there is nothing sched_ext specific in here.
>
> > Task local data is built on top of task local storage map and UPTR[0]
> > to achieve fast per-task data sharing. UPTR is a type of special field
> > supported in task local storage map value. A user page assigned to a UPTR
> > will be pinned by the kernel when the map is updated. Therefore, user
> > space programs can update data seen by bpf programs without syscalls.
> >
> > Additionally, unlike most bpf maps, task local data does not require a
> > static map value definition. This design is driven by sched_ext, which
> > would like to allow multiple developers to share a storage without the
> > need to explicitly agree on the layout of it. While a centralized layout
> > definition would have worked, the friction of synchronizing it across
> > different repos is not desirable. This simplify code base management and
> > makes experimenting easier.
> >
>
> This text is hard to follow because the point it makes belongs in the design
> section, this is a design point that does not really matter at this
> point in the text.
> At least make it more abstract maybe?
Thanks for reviewing. I will make it more abstract and move a part of
it to the design section.
>
> > In the rest of the cover letter, "task local data" is used to refer to
> > the abstract storage and TLD is used to denote a single data entry in
> > the storage.
> >
> >
> > * Design *
> >
> > Task local data library provides simple APIs for user space and bpf
> > through two header files, task_local_data.h and task_loca_data.bpf.h,
>
> Typo: task_loca_data ->task_local_data
>
> > respectively. The usage is illustrated in the following diagram.
> > An entry of data in the task local data, TLD, first needs to be defined
> > with TLD_DEFINE_KEY() with the size of the data and a name associated with
> > the data. The macro defines and initialize an opaque key object of
>
> Typo: initialize -> initializes
>
> > tld_key_t type, which can be used to locate a TLD. The same key may be
> > passed to tld_get_data() in different threads, and a pointer to data
> > specific to the calling thread will be returned. The pointer will
> > remain valid until the process terminates, so there is not need to call
>
> Typo: not need -> no need
>
> > tld_get_data() in subsequent accesses.
> >
> > TLD_DEFINE_KEY() is allowed to define TLDs up to roughly a page. In the
> > case when a TLD can only be known and created on the fly,
> > tld_create_key() can be called. Since the total TLD size cannot be known
> > beforehand, a memory of size TLD_DYN_DATA_SIZE is allocated for each
> > thread to accommodate them.
> >
> > On the bpf side, programs will use also use tld_get_data() to locate
> > TLDs. The arugments contain a name and a key to a TLD. The name is
> > used for the first tld_get_data() to a TLD, which will lookup the TLD
> > by name and save the corresponding key to a task local data map,
> > tld_key_map. The map value type, struct tld_keys, __must__ be defined by
> > developers. It should contain keys used in the compilation unit.
> >
> >
> > ┌─ Application ───────────────────────────────────────────────────────┐
> > │ TLD_DEFINE_KEY(kx, "X", 4); ┌─ library A ─────────────────────┐│
> > │ │ void func(...) ││
> > │ int main(...) │ { ││
> > │ { │ tld_key_t ky; ││
> > │ int *x; │ bool *y; ││
> > │ │ ││
> > │ x = tld_get_data(fd, kx); │ ky = tld_create_key("Y", 1);││
> > │ if (x) *x = 123; │ y = tld_get_data(fd, ky); ││
> > │ ┌────────┤ if (y) *y = true; ││
> > │ │ └─────────────────────────────────┘│
> > └───────┬─────────────────│───────────────────────────────────────────┘
> > V V
> > + ─ Task local data ─ ─ ─ ─ ─ + ┌─ BPF program ──────────────────────┐
> > | ┌─ tld_data_map ──────────┐ | │ struct tld_object obj; │
> > | │ BPF Task local storage │ | │ bool *y; │
> > | │ │ | │ int *x; │
> > | │ __uptr *data │ | │ │
> > | │ __uptr *metadata │ | │ if (tld_init_object(task, &obj)) │
> > | └─────────────────────────┘ | │ return 0; │
> > | ┌─ tld_key_map ───────────┐ | │ │
> > | │ BPF Task local storage │ | │ x = tld_get_data(&obj, kx, "X", 4);│
> > | │ │ |<─┤ if (x) /* do something */ │
> > | │ tld_key_t kx; │ | │ │
> > | │ tld_key_t ky; │ | │ y = tld_get_data(&obj, ky, "Y", 1);│
> > | └─────────────────────────┘ | │ if (y) /* do something */ │
> > + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ + └────────────────────────────────────┘
> >
> >
> >
> > * Implementation *
> >
> > Task local data defines the storage to be a task local storage map with
> > two UPTRs, data and metadata. Data points to a blob of memory for storing
> > TLDs individual to every task with the offset of data in a page. Metadata,
>
> Nit: Maybe just use the name of the struct, u_tld_metadata?
Will change.
>
> > individual to each process and shared by its threads, records the total
> > number and size of TLDs and the metadata of each TLD. Metadata for a
> > TLD contains the key name and the size of the TLD.
> >
> > struct u_tld_data {
> > u64 start;
> > char data[PAGE_SIZE - 8];
> > };
> >
> > struct u_tld_metadata {
> > u8 cnt;
> > u16 size;
> > struct metadata data[TLD_DATA_CNT];
>
> Rename data -> metadata? That's how it is in the code.
>
> > };
> >
> > Both user space and bpf API follow the same protocol when accessing
> > task local data. A pointer to a TLD is located by a key. tld_key_t
> > effectively is the offset of a TLD in data. To add a TLD, user space
>
> typo: user space API -> the user space API
>
> > API, loops through metadata->data until an empty slot is found and update
> > it. It also adds sizes of prior TLDs along the way to derive the offset.
> > To locate a TLD in bpf when the first time tld_get_data() is called,
> > __tld_fetch_key() also loops through metadata->data until the name is
>
> Same here, isn't it ->metadata?
>
Thanks for catching all the typos!
> > found. The offset is also derived by adding sizes. When the TLD is not
> > found, the current TLD count is cached instead to skip name comparison
> > that has been done. The detail of task local data operations can be found
> > in patch 1.
> >
> >
> > * Misc *
> >
> > The metadata can potentially use run-length encoding for names to reduce
> > memory wastage and support save more TLDs. I have a version that works,
> > but the selftest takes a bit longer to finish. More investigation needed
> > to find the root cause. I will save this for the future when there is a
> > need to store more than 63 TLDs.
> >
> >
> > [0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
> >
>
> Reviewed-by: Emil Tsalapatis <emil@...alapatis.com>
>
>
>
> > ---
> >
> > v5 -> v6
> > - Address Andrii's comment
> > - Fix verification failure in no_alu32
> > - Some cleanup
> > v5: https://lore.kernel.org/bpf/20250627233958.2602271-1-ameryhung@gmail.com/
> >
> > v4 -> v5
> > - Add an option to free memory on thread exit to prevent memory leak
> > - Add an option to reduce memory waste if the allocator can
> > use just enough memory to fullfill aligned_alloc() (e.g., glibc)
> > - Tweak bpf API
> > - Remove tld_fetch_key() as it does not work in init_tasl
> > - tld_get_data() now tries to fetch key if it is not cached yet
> > - Optimize bpf side tld_get_data()
> > - Faster fast path
> > - Less code
> > - Use stdatomic.h in user space library with seq_cst order
> > - Introduce TLD_DEFINE_KEY() as the default TLD creation API for
> > easier memory management.
> > - TLD_DEFINE_KEY() can consume memory up to a page and no memory
> > is wasted since their size is known before per-thread data
> > allocation.
> > - tld_create_key() can only use up to TLD_DYN_DATA_SIZE. Since
> > tld_create_key can run any time even after per-thread data
> > allocation, it is impossible to predict the total size. A
> > configurable size of memory is allocated on top of the total
> > size of TLD_DEFINE_KEY() to accommodate dynamic key creation.
> > - Add tld prefix to all macros
> > - Replace map_update(NO_EXIST) in __tld_init_data() with cmpxchg()
> > - No more +1,-1 dance on the bpf side
> > - Reduce printf from ASSERT in race test
> > - Try implementing run-length encoding for name and decide to
> > save it for the future
> > v4: https://lore.kernel.org/bpf/20250515211606.2697271-1-ameryhung@gmail.com/
> >
> > v3 -> v4
> > - API improvements
> > - Simplify API
> > - Drop string obfuscation
> > - Use opaque type for key
> > - Better documentation
> > - Implementation
> > - Switch to dynamic allocation for per-task data
> > - Now offer as header-only libraries
> > - No TLS map pinning; leave it to users
> > - Drop pthread dependency
> > - Add more invalid tld_create_key() test
> > - Add a race test for tld_create_key()
> > v3: https://lore.kernel.org/bpf/20250425214039.2919818-1-ameryhung@gmail.com/
> >
> >
> > Amery Hung (3):
> > selftests/bpf: Introduce task local data
> > selftests/bpf: Test basic task local data operations
> > selftests/bpf: Test concurrent task local data key creation
> >
> > .../bpf/prog_tests/task_local_data.h | 388 ++++++++++++++++++
> > .../bpf/prog_tests/test_task_local_data.c | 297 ++++++++++++++
> > .../selftests/bpf/progs/task_local_data.bpf.h | 227 ++++++++++
> > .../bpf/progs/test_task_local_data.c | 65 +++
> > 4 files changed, 977 insertions(+)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
> > create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> > create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data.c
> >
> > --
> > 2.47.1
> >
> >
Powered by blists - more mailing lists