[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYUNckc9pXcE7BawxWFVfY--p12c3ax8ySP1P+BEww91w@mail.gmail.com>
Date: Thu, 1 May 2025 13:37:23 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Amery Hung <ameryhung@...il.com>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org, alexei.starovoitov@...il.com,
andrii@...nel.org, daniel@...earbox.net, tj@...nel.org, martin.lau@...nel.org,
kernel-team@...a.com
Subject: Re: [PATCH RFC v3 0/2] Task local data API
On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@...il.com> wrote:
>
> Hi,
>
> This a respin of uptr KV store. It is renamed to task local data (TLD)
> as the problem statement and the solution have changed, and it now draws
> more similarities to pthread thread-specific data.
>
> * Overview *
>
> This patchset is a continuation of the original UPTR work[0], which aims
> to provide a fast way for user space programs to pass per-task hints to
> sched_ext schedulers. UPTR built the foundation by supporting sharing
> user pages with bpf programs through task local storage maps.
>
> Additionally, sched_ext would like to allow multiple developers to share
> a storage without the need to explicitly agreeing on the layout of it.
> This simplify code base management and makes experimenting easier.
> While a centralized storage layout definition would have worked, the
> friction of synchronizing it across different repos is not desirable.
>
> This patchset contains the user space plumbing so that user space and bpf
> program developers can exchange per-task hints easily through simple
> interfaces.
>
> * Design *
>
> BPF task local data is a simple API for sharing task-specific data
> between user space and bpf programs, where data are refered to using
> string keys. As shown in the following figure, user space programs can
> define a task local data using bpf_tld_type_var(). The data is
> effectively a variable declared with __thread, which every thread owns an
> independent copy and can be directly accessed. On the bpf side, a task
> local data first needs to be initialized for every new task once (e.g.,
> in sched_ext_ops::init_task) using bpf_tld_init_var(). Then, other bpf
> programs can get a pointer to the data using bpf_tld_lookup(). The task
> local data APIs refer to data using string keys so developers
> does not need to deal with addresses of data in a shared storage.
>
> ┌─ Application ─────────────────────────────────────────┐
> │ ┌─ library A ──────────────┐ │
> │ bpf_tld_type_var(int, X) │ bpf_tld_type_var(int, Y) │ │
> │ └┬─────────────────────────┘ │
> └───────┬───────────────────│───────────────────────────┘
> │ X = 123; │ Y = true;
bpf_tld_type_var() is *defining* variable (i.e., it allocates storage
for it), right? I think for completeness we need also *declare* macro
to access that variable from other compilation units
> V V
> + ─ Task local data ─ ─ ─ ─ ─ ─ +
> | ┌─ task_kvs_map ────────────┐ | ┌─ sched_ext_ops::init_task ──────┐
> | │ BPF Task local storage │ | │ bpf_tld_init_var(&kvs, X); │
> | │ ┌───────────────────┐ │ |<─┤ bpf_tld_init_var(&kvs, Y); │
> | │ │ __uptr *udata │ │ | └─────────────────────────────────┘
> | │ └───────────────────┘ │ |
> | │ ┌───────────────────┐ │ | ┌─ Other sched_ext_ops op ────────┐
> | │ │ __uptr *umetadata │ │ | │ int *y; ├┐
> | │ └───────────────────┘ │ |<─┤ y = bpf_tld_lookup(&kvs, Y, 1); ││
> | └───────────────────────────┘ | │ if (y) ││
> | ┌─ task_kvs_off_map ────────┐ | │ /* do something */ ││
> | │ BPF Task local storage │ | └┬────────────────────────────────┘│
> | └───────────────────────────┘ | └─────────────────────────────────┘
> + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ +
>
> * Implementation *
>
> Task local data API hides the memory management from the developers.
> Internally, it shares user data with bpf programs through udata UPTRs.
> Task local data from different compilation units are placed into a
> custom "udata" section by the declaration API, bpf_tld_type_var(), so
> that they are placed together in the memory. User space will need to
> call bpf_tld_thread_init() for every new thread to pin udata pages to
> kernel.
>
> The metadata used to address udata is stored in umetadata UPTR. It is
> generated by constructors inserted by bpf_tld_type_var() and
> bpf_tld_thread_init(). umetadata is an array of 64 metadata corresponding
> to each data, which contains the key and the offset of data in udata.
> During initialization, bpf_tld_init_var() will search umetadata for
> a matching key and cache its offset in task_kvs_off_map. Later,
> bpf_tld_lookup() will use the cached offset to retreive a pointer to
> udata.
>
> * Limitation *
>
> Currently, it is assumed all key-value pairs are known as a program
> starts. All compilation units using task local data should be statically
> linked together so that values are all placed together in a udata section
> and therefore can be shared with bpf through two UPTRs. The next
Lack of support for shared libraries is a big limitation, IMO, so I
think we should design for that support from the very beginning.
FWIW, I think this compile-time static __thread variable definitions
are unnecessarily limiting and add a non-trivial amount of complexity.
I think it's more reasonable to go with a more dynamic approach, and
I'll try to outline what an API (and some implementation details)
might look like.
Note, all this is related to the user-space side of things. BPF-side
is unchanged, effectively, except you get a guarantee that all the
data will definitely be page-aligned, so you won't need to do these
two uptr pages handling.
First, data structures. You'll have one per-process metadata structure
describing known keys and their assigned "offsets":
struct tld_metadata {
int cnt;
char keys[MAX_KEY_CNT];
__u16 offs[MAX_KEY_CNT];
__u16 szs[MAX_KEY_CNT];
};
Now, each thread will basically have just a blob of data, so,
technically, per-thread we will just have:
struct tld_data {
__u8 data[PAGE_SIZE];
};
By pre-allocating the entire page we avoid lots of complications, so I
think it's worth doing.
Now, we really need just two APIs on user-space (and I'll use the
"opaque offset" type as I suggested on another patch):
typedef struct { int off; } tld_off_t;
tld_off_t tld_add_key_def(const char *key_name, size_t key_size);
This API can be called just once per each key that process cares
about. And this can be done at any point, really, very dynamically.
The implementation will:
- (just once per process) open pinned BPF map, remember its FD;
- (just once) allocate struct tld_metadata, unless we define it as
pre-allocated global variable;
- (locklessly) check if key_name is already in tld_metadata, if yes
- return already assigned offset;
- (locklessly) if not, add this key and assign it offset that is
offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
values (though we should take care of alignment requirements, of
course);
- return newly assigned offset;
Now, the second essential API is called for each participating thread
for each different key. And again, this is all very dynamic. It's
possible that some threads won't use any of this TLD stuff, in which
case there will be no overhead (memory or CPU), and not even an entry
in task local storage map for that thread. So, API:
void *tld_resolve_key(tld_off_t key_off);
This API will:
- (just once per thread, which is done trivially by using
__thread-local global variable to keep track of this) allocate struct
tld_data dynamically (with page alignment, alloc_aligned(PAGE_SIZE,
PAGE_SIZE))
- (just once per thread as well) bpf_map_update_elem() for current
thread, updating two uptrs: one pointing to global tld_metadata,
another pointing to thread-local tld_data;
- return tld_data->data + key_off.off;
That is, this API returns an absolute memory address of a value
resolved in the context of the current thread.
And let's look at how one can make use of this on the user-space side
to optimally use this API.
/* global variables */
tld_off_t my_key1_off; /* struct my_val */
tld_off_t my_key2_off; /* int */
__thread struct my_val *my_val1;
__thread int *my_val2;
... somewhere in constructor ...
my_key1_off = tld_add_key_def("my_key1", sizeof(struct my_val));
my_key2_off = tld_add_key_def("my_key2", sizeof(int));
... and then somewhere in the code that makes use of TLD stuff ...
if (!my_val1) /* this can be initialized once per thread to avoid this
check (or hidden in a helper accessor function) */
my_val1 = tld_resolve(my_key1_off);
my_val1->blah_field = 123;
if (!my_val2)
my_val2 = tld_resolve(my_key2_off);
*my_val2 = 567;
That's pretty much it, I think.
In C++ code base, it should be possible to make this even more
convenient by using a templated wrapper with thread-local inner
variable with its own constructor. Adding operator overloading (e.g.,
operator= and operator->) you get a very naturally looking definition
and access patterns:
/* global data */
tld_variable<struct my_val> my_val1("my_key1");
tld_variable<int> my_val2("my_key2");
... now in the actual TLD-using code, we just do:
my_val1->blah_field = 123;
my_val2 = 567; /* I think C++ would allow this with operator overloading */
I hope the example explains why it's still fast despite everything
being dynamic. There is a pointer indirection and that page-sized
allocation (so that we can cache thread-specific resolved pointers),
but it seems absolutely fine from performance perspective.
> iteration will explore how bpf task local data can work in dynamic
> libraries. Maybe more udata UPTRs will be added to pin page of TLS
> of dynamically loaded modules. Or maybe it will allocate memory for data
> instead of relying on __thread, and change how user space interact with
> task local data slightly. The later approach can also save some troubles
> dealing with the restriction of UPTR.
>
> Some other limitations:
> - Total task local data cannot exceed a page
> - Only support 64 task local data
> - Some memory waste for data whose size is not power of two
> due to UPTR limitation
>
> [0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
>
>
> Amery Hung (2):
> selftests/bpf: Introduce task local data
> selftests/bpf: Test basic workflow of task local data
>
> .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
> .../bpf/prog_tests/task_local_data.h | 58 ++++++
> .../bpf/prog_tests/test_task_local_data.c | 156 +++++++++++++++
> .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
> .../bpf/progs/test_task_local_data_basic.c | 78 ++++++++
> .../selftests/bpf/task_local_data_common.h | 49 +++++
> 6 files changed, 681 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
> create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
>
> --
> 2.47.1
>
Powered by blists - more mailing lists