lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLPyHnqEbPowpN+JuMwH+iMX4=dZZu2chMaiexwAVxxJA@mail.gmail.com>
Date: Fri, 16 May 2025 11:57:25 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Amery Hung <ameryhung@...il.com>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>, 
	Andrii Nakryiko <andrii@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, Tejun Heo <tj@...nel.org>, 
	Kumar Kartikeya Dwivedi <memxor@...il.com>, Martin KaFai Lau <martin.lau@...nel.org>, 
	Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data

On Thu, May 15, 2025 at 2:16 PM Amery Hung <ameryhung@...il.com> wrote:
>
> Task local data defines an abstract storage type for storing task-
> specific data (TLD). This patch provides user space and bpf
> implementation as header-only libraries for accessing task local data.
>
> Task local data is a bpf task local storage map with two UPTRs:
> 1) u_tld_metadata, shared by all tasks of the same process, consists of
> the total count of TLDs and an array of metadata of TLDs. A metadata of
> a TLD comprises the size and the name. The name is used to identify a
> specific TLD in bpf 2) data is memory for storing TLDs specific to the
> task.
>
> The following are the basic task local data API:
>
>                  User space               BPF
> Create key     tld_create_key()            -
> Fetch key            -               tld_fetch_key()
> Get data       tld_get_data()        tld_get_data()
>
> A TLD is first created by the user space with tld_create_key(). First,
> it goes through the metadata array to check if the TLD can be added.
> The total TLD size needs to fit into a page (limited by UPTR), and no
> two TLDs can have the same name. It also calculates the offset, the next
> available space in u_tld_data, by summing sizes of TLDs. If the TLD
> can be added, it increases the count using cmpxchg as there may be other
> concurrent tld_create_key(). After a successful cmpxchg, the last
> metadata slot now belongs to the calling thread and will be updated.
> tld_create_key() returns the offset encapsulated as a opaque object key
> to prevent user misuse.
>
> Then user space can pass the key to tld_get_data() to get a pointer
> to the TLD. The pointer will remain valid for the lifetime of the
> thread.
>
> BPF programs also locate TLDs with the keys. This is done by calling
> tld_fetch_key() with the name of the TLD. Similar to tld_create_key(),
> it scans through metadata array, compare the name of TLDs and compute
> the offset. Once found, the offset is also returned as a key, which
> can be passed to the bpf version of tld_get_data() to retrieve a
> pointer to the TLD.
>
> User space task local data library uses a light way approach to ensure
> thread safety (i.e., atomic operation + compiler and memory barriers).
> While a metadata is being updated, other threads may also try to read it.
> To prevent them from seeing incomplete data, metadata::size is used to
> signal the completion of the update, where 0 means the update is still
> ongoing. Threads will wait until seeing a non-zero size to read a
> metadata. Acquire/release order is required for metadata::size to
> prevent hardware reordering. For example, moving store to metadata::name
> after store to metadata::size or moving load from metadata::name before
> load from metadata::size.
>
> Signed-off-by: Amery Hung <ameryhung@...il.com>
> ---
>  .../bpf/prog_tests/task_local_data.h          | 263 ++++++++++++++++++
>  .../selftests/bpf/progs/task_local_data.bpf.h | 220 +++++++++++++++
>  2 files changed, 483 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
>  create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.bpf.h
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.h b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
> new file mode 100644
> index 000000000000..ec43ea59267c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
> @@ -0,0 +1,263 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __TASK_LOCAL_DATA_H
> +#define __TASK_LOCAL_DATA_H
> +
> +#include <fcntl.h>
> +#include <errno.h>
> +#include <sched.h>
> +#include <stddef.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +
> +#include <bpf/bpf.h>
> +
> +#ifndef PIDFD_THREAD
> +#define PIDFD_THREAD O_EXCL
> +#endif
> +
> +#define PAGE_SIZE 4096
> +
> +#ifndef __round_mask
> +#define __round_mask(x, y) ((__typeof__(x))((y)-1))
> +#endif
> +#ifndef round_up
> +#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
> +#endif
> +
> +#ifndef READ_ONCE
> +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
> +#endif
> +
> +#ifndef WRITE_ONCE
> +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val))
> +#endif
> +
> +#define TLD_DATA_SIZE PAGE_SIZE

wrap it with #ifndef ?

So that application can #define something smaller.

> +#define TLD_DATA_CNT 63
> +#define TLD_NAME_LEN 62
> +
> +typedef struct {
> +       __s16 off;
> +} tld_key_t;
> +
> +struct tld_metadata {
> +       char name[TLD_NAME_LEN];
> +       __u16 size;
> +};
> +
> +struct u_tld_metadata {
> +       __u8 cnt;
> +       __u8 padding[63];
> +       struct tld_metadata metadata[TLD_DATA_CNT];
> +};
> +
> +struct u_tld_data {
> +       char data[TLD_DATA_SIZE];
> +};
> +
> +struct tld_map_value {
> +       struct u_tld_data *data;
> +       struct u_tld_metadata *metadata;
> +};
> +
> +struct u_tld_metadata *tld_metadata_p __attribute__((weak));
> +__thread struct u_tld_data *tld_data_p __attribute__((weak));
> +
> +static int __tld_init_metadata(int map_fd)
> +{
> +       struct u_tld_metadata *new_metadata;
> +       struct tld_map_value map_val;
> +       int task_fd = 0, err;
> +
> +       task_fd = syscall(SYS_pidfd_open, getpid(), 0);

this task_fd and task_fd in __tld_init_data() are two different things.
Please name them differently.

> +       if (task_fd < 0) {
> +               err = -errno;
> +               goto out;
> +       }
> +
> +       new_metadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
> +       if (!new_metadata) {
> +               err = -ENOMEM;
> +               goto out;
> +       }
> +
> +       memset(new_metadata, 0, PAGE_SIZE);
> +
> +       map_val.data = NULL;
> +       map_val.metadata = new_metadata;
> +
> +       /*
> +        * bpf_map_update_elem(.., pid_fd,..., BPF_NOEXIST) guarantees that
> +        * only one tld_create_key() can update tld_metadata_p.
> +        */
> +       err = bpf_map_update_elem(map_fd, &task_fd, &map_val, BPF_NOEXIST);
> +       if (err) {
> +               free(new_metadata);
> +               if (err == -EEXIST || err == -EAGAIN)
> +                       err = 0;
> +               goto out;
> +       }
> +
> +       WRITE_ONCE(tld_metadata_p, new_metadata);
> +out:
> +       if (task_fd > 0)

>=

> +               close(task_fd);
> +       return err;
> +}
> +
> +static int __tld_init_data(int map_fd)
> +{
> +       struct u_tld_data *new_data = NULL;
> +       struct tld_map_value map_val;
> +       int err, task_fd = 0;
> +
> +       task_fd = syscall(SYS_pidfd_open, gettid(), PIDFD_THREAD);
> +       if (task_fd < 0) {
> +               err = -errno;
> +               goto out;
> +       }
> +
> +       new_data = aligned_alloc(PAGE_SIZE, TLD_DATA_SIZE);

probably roundup_power_of_two(TLD_DATA_SIZE) ?
Don't know about libc, but musl implementation of aligned_alloc()
is naive. It allocs align+size.
So it's a memory waste.

> +       if (!new_data) {
> +               err = -ENOMEM;
> +               goto out;
> +       }
> +
> +       map_val.data = new_data;
> +       map_val.metadata = READ_ONCE(tld_metadata_p);
> +
> +       err = bpf_map_update_elem(map_fd, &task_fd, &map_val, 0);
> +       if (err) {
> +               free(new_data);
> +               goto out;
> +       }
> +
> +       tld_data_p = new_data;
> +out:
> +       if (task_fd > 0)

>=

> +               close(task_fd);
> +       return err;
> +}
> +
> +/**
> + * tld_create_key() - Create a key associated with a TLD.
> + *
> + * @map_fd: A file descriptor of the underlying task local storage map,
> + * tld_data_map
> + * @name: The name the TLD will be associated with
> + * @size: Size of the TLD
> + *
> + * Returns an opaque object key. Use tld_key_is_err() or tld_key_err_or_zero() to
> + * check if the key creation succeed. Pass to tld_get_data() to get a pointer to
> + * the TLD. bpf programs can also fetch the same key by name.
> + */
> +__attribute__((unused))
> +static tld_key_t tld_create_key(int map_fd, const char *name, size_t size)
> +{
> +       int err, i, cnt, sz, off = 0;
> +
> +       if (!READ_ONCE(tld_metadata_p)) {
> +               err = __tld_init_metadata(map_fd);
> +               if (err)
> +                       return (tld_key_t) {.off = err};
> +       }
> +
> +       if (!tld_data_p) {
> +               err = __tld_init_data(map_fd);
> +               if (err)
> +                       return (tld_key_t) {.off = err};
> +       }
> +
> +       size = round_up(size, 8);

why roundup ? and to 8 in particular?
If user space wants byte size keys, why not let it?

> +
> +       for (i = 0; i < TLD_DATA_CNT; i++) {
> +retry:
> +               cnt = __atomic_load_n(&tld_metadata_p->cnt, __ATOMIC_RELAXED);

Instead of explicit __atomic builtins use _Atomic __u8 cnt;
?

> +               if (i < cnt) {
> +                       /*
> +                        * Pending tld_create_key() uses size to signal if the metadata has
> +                        * been fully updated.
> +                        */
> +                       while (!(sz = __atomic_load_n(&tld_metadata_p->metadata[i].size,
> +                                                     __ATOMIC_ACQUIRE)))
> +                               sched_yield();
> +
> +                       if (!strncmp(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN))
> +                               return (tld_key_t) {.off = -EEXIST};
> +
> +                       off += sz;
> +                       continue;
> +               }
> +
> +               if (off + size > TLD_DATA_SIZE)
> +                       return (tld_key_t) {.off = -E2BIG};
> +
> +               /*
> +                * Only one tld_create_key() can increase the current cnt by one and
> +                * takes the latest available slot. Other threads will check again if a new
> +                * TLD can still be added, and then compete for the new slot after the
> +                * succeeding thread update the size.
> +                */
> +               if (!__atomic_compare_exchange_n(&tld_metadata_p->cnt, &cnt, cnt + 1, true,
> +                                                __ATOMIC_RELAXED, __ATOMIC_RELAXED))

weak and relaxed/relaxed ?
That's unusual.
I don't know what it is supposed to do.
I think weak=false and __ATOMIC_ACQUIRE, __ATOMIC_RELAXED
would look as expected.

> +                       goto retry;
> +
> +               strncpy(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN);
> +               __atomic_store_n(&tld_metadata_p->metadata[i].size, size, __ATOMIC_RELEASE);
> +               return (tld_key_t) {.off = off};
> +       }
> +
> +       return (tld_key_t) {.off = -ENOSPC};
> +}
> +
> +__attribute__((unused))
> +static inline bool tld_key_is_err(tld_key_t key)
> +{
> +       return key.off < 0;
> +}
> +
> +__attribute__((unused))
> +static inline int tld_key_err_or_zero(tld_key_t key)
> +{
> +       return tld_key_is_err(key) ? key.off : 0;
> +}
> +
> +/**
> + * tld_get_data() - Gets a pointer to the TLD associated with the key.
> + *
> + * @map_fd: A file descriptor of the underlying task local storage map,
> + * tld_data_map
> + * @key: A key object returned by tld_create_key().
> + *
> + * Returns a pointer to the TLD if the key is valid; NULL if no key has been
> + * added, not enough memory for TLD for this thread, or the key is invalid.
> + *
> + * Threads that call tld_get_data() must call tld_free() on exit to prevent
> + * memory leak.
> + */
> +__attribute__((unused))
> +static void *tld_get_data(int map_fd, tld_key_t key)
> +{
> +       if (!READ_ONCE(tld_metadata_p))
> +               return NULL;
> +
> +       if (!tld_data_p && __tld_init_data(map_fd))
> +               return NULL;

Why call it again?
tld_create_key() should have done it, no?

> +
> +       return tld_data_p->data + key.off;
> +}
> +
> +/**
> + * tld_free() - Frees task local data memory of the calling thread
> + */
> +__attribute__((unused))
> +static void tld_free(void)
> +{
> +       if (tld_data_p)
> +               free(tld_data_p);
> +}

Since this .h allocates tld_metadata_p, it probably needs
a helper to free it too.

> +
> +#endif /* __TASK_LOCAL_DATA_H */
> diff --git a/tools/testing/selftests/bpf/progs/task_local_data.bpf.h b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> new file mode 100644
> index 000000000000..5f48e408a5e5
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> @@ -0,0 +1,220 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __TASK_LOCAL_DATA_BPF_H
> +#define __TASK_LOCAL_DATA_BPF_H
> +
> +/*
> + * Task local data is a library that facilitates sharing per-task data
> + * between user space and bpf programs.
> + *
> + *
> + * PREREQUISITE
> + *
> + * A TLD, an entry of data in task local data, first needs to be created by the
> + * user space. This is done by calling user space API, tld_create_key(), with
> + * the name of the TLD and the size.
> + *
> + *     tld_key_t prio, in_cs;
> + *
> + *     prio = tld_create_key("priority", sizeof(int));
> + *     in_cs = tld_create_key("in_critical_section", sizeof(bool));
> + *
> + * A key associated with the TLD, which has an opaque type tld_key_t, will be
> + * returned. It can be used to get a pointer to the TLD in the user space by
> + * calling tld_get_data().
> + *
> + *
> + * USAGE
> + *
> + * Similar to user space, bpf programs locate a TLD using the same key.
> + * tld_fetch_key() allows bpf programs to retrieve a key created in the user
> + * space by name, which is specified in the second argument of tld_create_key().
> + * tld_fetch_key() additionally will cache the key in a task local storage map,
> + * tld_key_map, to avoid performing costly string comparisons every time when
> + * accessing a TLD. This requires the developer to define the map value type of
> + * tld_key_map, struct tld_keys. It only needs to contain keys used by bpf
> + * programs in the compilation unit.
> + *
> + * struct tld_keys {
> + *     tld_key_t prio;
> + *     tld_key_t in_cs;
> + * };
> + *
> + * Then, for every new task, a bpf program will fetch and cache keys once and
> + * for all. This should be done ideally in a non-critical path (e.g., in
> + * sched_ext_ops::init_task).
> + *
> + *     struct tld_object tld_obj;
> + *
> + *     err = tld_object_init(task, &tld);
> + *     if (err)
> + *         return 0;
> + *
> + *     tld_fetch_key(&tld_obj, "priority", prio);
> + *     tld_fetch_key(&tld_obj, "in_critical_section", in_cs);
> + *
> + * Note that, the first argument of tld_fetch_key() is a pointer to tld_object.
> + * It should be declared as a stack variable and initialized via tld_object_init().
> + *
> + * Finally, just like user space programs, bpf programs can get a pointer to a
> + * TLD by calling tld_get_data(), with cached keys.
> + *
> + *     int *p;
> + *
> + *     p = tld_get_data(&tld_obj, prio, sizeof(int));
> + *     if (p)
> + *         // do something depending on *p
> + */
> +#include <errno.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#define TLD_DATA_SIZE __PAGE_SIZE
> +#define TLD_DATA_CNT 63
> +#define TLD_NAME_LEN 62
> +
> +typedef struct {
> +       __s16 off;
> +} tld_key_t;
> +
> +struct u_tld_data *dummy_data;
> +struct u_tld_metadata *dummy_metadata;
> +
> +struct tld_metadata {
> +       char name[TLD_NAME_LEN];
> +       __u16 size;
> +};
> +
> +struct u_tld_metadata {
> +       __u8 cnt;
> +       __u8 padding[63];
> +       struct tld_metadata metadata[TLD_DATA_CNT];
> +};
> +
> +struct u_tld_data {
> +       char data[TLD_DATA_SIZE];
> +};
> +
> +struct tld_map_value {
> +       struct u_tld_data __uptr *data;
> +       struct u_tld_metadata __uptr *metadata;
> +};
> +
> +struct tld_object {
> +       struct tld_map_value *data_map;
> +       struct tld_keys *key_map;
> +};
> +
> +/*
> + * Map value of tld_key_map for caching keys. Must be defined by the developer.
> + * Members should be tld_key_t and passed to the 3rd argument of tld_fetch_key().
> + */
> +struct tld_keys;
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC);
> +       __type(key, int);
> +       __type(value, struct tld_map_value);
> +} tld_data_map SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC);
> +       __type(key, int);
> +       __type(value, struct tld_keys);
> +} tld_key_map SEC(".maps");
> +
> +/**
> + * tld_object_init() - Initializes a tld_object.
> + *
> + * @task: The task_struct of the target task
> + * @tld_obj: A pointer to a tld_object to be initialized
> + *
> + * Returns 0 on success; -ENODATA if the task has no TLD; -ENOMEM if the creation
> + * of tld_key_map fails
> + */
> +__attribute__((unused))
> +static int tld_object_init(struct task_struct *task, struct tld_object *tld_obj)
> +{
> +       tld_obj->data_map = bpf_task_storage_get(&tld_data_map, task, 0, 0);
> +       if (!tld_obj->data_map)
> +               return -ENODATA;
> +
> +       tld_obj->key_map = bpf_task_storage_get(&tld_key_map, task, 0,
> +                                               BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       if (!tld_obj->key_map)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +/**
> + * tld_fetch_key() - Fetches the key to a TLD by name. The key has to be created
> + * by user space first with tld_create_key().
> + *
> + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> + * @name: The name of the key associated with a TLD
> + * @key: The key in struct tld_keys to be saved to
> + *
> + * Returns a positive integer on success; 0 otherwise
> + */
> +#define tld_fetch_key(tld_obj, name, key)                                      \
> +       ({                                                                      \
> +               (tld_obj)->key_map->key.off = __tld_fetch_key(tld_obj, name);   \
> +       })
> +
> +__attribute__((unused))
> +static u16 __tld_fetch_key(struct tld_object *tld_obj, const char *name)
> +{
> +       int i, meta_off, cnt;
> +       void *metadata, *nm, *sz;
> +       u16 off = 0;
> +
> +       if (!tld_obj->data_map || !tld_obj->data_map->metadata)
> +               return 0;
> +
> +       cnt = tld_obj->data_map->metadata->cnt;
> +       metadata = tld_obj->data_map->metadata->metadata;
> +
> +       bpf_for(i, 0, cnt) {
> +               meta_off = i * sizeof(struct tld_metadata);
> +               if (meta_off > TLD_DATA_SIZE - offsetof(struct u_tld_metadata, metadata)
> +                                          - sizeof(struct tld_metadata))
> +                       break;
> +
> +               nm = metadata + meta_off + offsetof(struct tld_metadata, name);
> +               sz = metadata + meta_off + offsetof(struct tld_metadata, size);
> +
> +               /*
> +                * Reserve 0 for uninitialized keys. Increase the offset of a valid key
> +                * by one and subtract it later in tld_get_data().
> +                */
> +               if (!bpf_strncmp(nm, TLD_NAME_LEN, name))
> +                       return off + 1;

I think all this +1, -1 dance is doing is helping to catch an
error when tld_get_data() is called without tld_fetch_key().
I feel this is too defensive.

Let tld_fetch_key() return proper negative error code.

> +
> +               off += *(u16 *)sz;
> +       }
> +
> +       return 0;
> +}
> +
> +/**
> + * tld_get_data() - Retrieves a pointer to the TLD associated with the key.
> + *
> + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> + * @key: The key of a TLD saved in tld_maps
> + * @size: The size of the TLD. Must be a known constant value
> + *
> + * Returns a pointer to the TLD data associated with the key; NULL if the key
> + * is not valid or the size is too big
> + */
> +#define tld_get_data(tld_obj, key, size) \
> +       __tld_get_data(tld_obj, (tld_obj)->key_map->key.off - 1, size)
> +
> +__attribute__((unused))
> +__always_inline void *__tld_get_data(struct tld_object *tld_obj, u32 off, u32 size)
> +{
> +       return (tld_obj->data_map->data && off >= 0 && off < TLD_DATA_SIZE - size) ?
> +               (void *)tld_obj->data_map->data + off : NULL;

If we require users to call tld_fetch_key() first and check for errors
tld_get_data() can be faster:
#define tld_get_data(tld_obj, key)
   (void *)tld_obj->data_map->data + tld_obj->key_map->key.off

I wouldn't bother with extra checks, especially for size.

Bigger question.. can we cache the whole pointer for each key ?
and then
#define tld_get_data(tld_obj, key) ld_obj->key_map->key

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ