[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQK30M9+eJz8OjFpteGXfpF6DoQqNxXJa3p5YGmxyG7xJw@mail.gmail.com>
Date: Fri, 16 May 2025 15:22:01 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Amery Hung <ameryhung@...il.com>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
Andrii Nakryiko <andrii@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, Tejun Heo <tj@...nel.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>, Martin KaFai Lau <martin.lau@...nel.org>,
Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data
On Fri, May 16, 2025 at 1:41 PM Amery Hung <ameryhung@...il.com> wrote:
>
> > > + size = round_up(size, 8);
> >
> > why roundup ? and to 8 in particular?
> > If user space wants byte size keys, why not let it?
>
> I will remove it. This was to prevent breaking using TLD in atomic
> operations, but it should be very unlikely as they are thread
> specific.
You mean for a case where one part of the app (like a shared library)
is using u32, but the other is using u64 and doing atomic ops on it?
Make sense to align the off set by tld_create_key(),
but it can be done without rounding up all previous keys to 8.
63 bytes in the header are wasted. So use 2 as an offset.
A single cmpxchg 4 byte can update cnt+offset.
Actually why store size in each tld_metadata ?
Won't the logic will be simpler if it's an offset ?
bpf side tld_fetch_key() wouldn't need to count.
> > > + if (i < cnt) {
> > > + /*
> > > + * Pending tld_create_key() uses size to signal if the metadata has
> > > + * been fully updated.
> > > + */
> > > + while (!(sz = __atomic_load_n(&tld_metadata_p->metadata[i].size,
> > > + __ATOMIC_ACQUIRE)))
> > > + sched_yield();
> > > +
> > > + if (!strncmp(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN))
> > > + return (tld_key_t) {.off = -EEXIST};
> > > +
> > > + off += sz;
> > > + continue;
> > > + }
> > > +
> > > + if (off + size > TLD_DATA_SIZE)
> > > + return (tld_key_t) {.off = -E2BIG};
> > > +
> > > + /*
> > > + * Only one tld_create_key() can increase the current cnt by one and
> > > + * takes the latest available slot. Other threads will check again if a new
> > > + * TLD can still be added, and then compete for the new slot after the
> > > + * succeeding thread update the size.
> > > + */
> > > + if (!__atomic_compare_exchange_n(&tld_metadata_p->cnt, &cnt, cnt + 1, true,
> > > + __ATOMIC_RELAXED, __ATOMIC_RELAXED))
> >
> > weak and relaxed/relaxed ?
>
> I can't see reordering issue with cnt so I choose to use relax. I can
> change to strong acq/rel just to be safe.
>
> > That's unusual.
> > I don't know what it is supposed to do.
> > I think weak=false and __ATOMIC_ACQUIRE, __ATOMIC_RELAXED
> > would look as expected.
> >
>
> Do you mean weak=false and __ATOMIC_RELAXED, __ATOMIC_ACQUIRE?
no idea. I just grepped the kernel and saw:
TEST_KERNEL_LOCKED(atomic_builtin_with_memorder,
__atomic_compare_exchange_n(flag, &v, 1, 0,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED),
__atomic_store_n(flag, 0, __ATOMIC_RELEASE));
TEST_KERNEL_LOCKED(atomic_builtin_wrong_memorder,
__atomic_compare_exchange_n(flag, &v, 1, 0,
__ATOMIC_RELAXED, __ATOMIC_RELAXED),
__atomic_store_n(flag, 0, __ATOMIC_RELAXED));
I'd just use __ATOMIC_SEQ_CST everywhere.
Speed is not important here.
>
> > > + goto retry;
> > > +
> > > + strncpy(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN);
> > > + __atomic_store_n(&tld_metadata_p->metadata[i].size, size, __ATOMIC_RELEASE);
> > > + return (tld_key_t) {.off = off};
> > > + }
> > > +
> > > + return (tld_key_t) {.off = -ENOSPC};
> > > +}
> > > +
> > > +__attribute__((unused))
> > > +static inline bool tld_key_is_err(tld_key_t key)
> > > +{
> > > + return key.off < 0;
> > > +}
> > > +
> > > +__attribute__((unused))
> > > +static inline int tld_key_err_or_zero(tld_key_t key)
> > > +{
> > > + return tld_key_is_err(key) ? key.off : 0;
> > > +}
> > > +
> > > +/**
> > > + * tld_get_data() - Gets a pointer to the TLD associated with the key.
> > > + *
> > > + * @map_fd: A file descriptor of the underlying task local storage map,
> > > + * tld_data_map
> > > + * @key: A key object returned by tld_create_key().
> > > + *
> > > + * Returns a pointer to the TLD if the key is valid; NULL if no key has been
> > > + * added, not enough memory for TLD for this thread, or the key is invalid.
> > > + *
> > > + * Threads that call tld_get_data() must call tld_free() on exit to prevent
> > > + * memory leak.
> > > + */
> > > +__attribute__((unused))
> > > +static void *tld_get_data(int map_fd, tld_key_t key)
> > > +{
> > > + if (!READ_ONCE(tld_metadata_p))
> > > + return NULL;
> > > +
> > > + if (!tld_data_p && __tld_init_data(map_fd))
> > > + return NULL;
> >
> > Why call it again?
> > tld_create_key() should have done it, no?
> >
>
> A TLD is created by calling tld_create_key() once. Then, threads may
> call tld_get_data() to get their thread-specific TLD. So it is
> possible for a thread to call tld_get_data() with tld_data_p==NULL.
I see. Please add a comment.
> > > +
> > > + return tld_data_p->data + key.off;
> > > +}
> > > +
> > > +/**
> > > + * tld_free() - Frees task local data memory of the calling thread
> > > + */
> > > +__attribute__((unused))
> > > +static void tld_free(void)
> > > +{
> > > + if (tld_data_p)
> > > + free(tld_data_p);
> > > +}
> >
> > Since this .h allocates tld_metadata_p, it probably needs
> > a helper to free it too.
> >
> > > +
> > > +#endif /* __TASK_LOCAL_DATA_H */
> > > diff --git a/tools/testing/selftests/bpf/progs/task_local_data.bpf.h b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> > > new file mode 100644
> > > index 000000000000..5f48e408a5e5
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> > > @@ -0,0 +1,220 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef __TASK_LOCAL_DATA_BPF_H
> > > +#define __TASK_LOCAL_DATA_BPF_H
> > > +
> > > +/*
> > > + * Task local data is a library that facilitates sharing per-task data
> > > + * between user space and bpf programs.
> > > + *
> > > + *
> > > + * PREREQUISITE
> > > + *
> > > + * A TLD, an entry of data in task local data, first needs to be created by the
> > > + * user space. This is done by calling user space API, tld_create_key(), with
> > > + * the name of the TLD and the size.
> > > + *
> > > + * tld_key_t prio, in_cs;
> > > + *
> > > + * prio = tld_create_key("priority", sizeof(int));
> > > + * in_cs = tld_create_key("in_critical_section", sizeof(bool));
> > > + *
> > > + * A key associated with the TLD, which has an opaque type tld_key_t, will be
> > > + * returned. It can be used to get a pointer to the TLD in the user space by
> > > + * calling tld_get_data().
> > > + *
> > > + *
> > > + * USAGE
> > > + *
> > > + * Similar to user space, bpf programs locate a TLD using the same key.
> > > + * tld_fetch_key() allows bpf programs to retrieve a key created in the user
> > > + * space by name, which is specified in the second argument of tld_create_key().
> > > + * tld_fetch_key() additionally will cache the key in a task local storage map,
> > > + * tld_key_map, to avoid performing costly string comparisons every time when
> > > + * accessing a TLD. This requires the developer to define the map value type of
> > > + * tld_key_map, struct tld_keys. It only needs to contain keys used by bpf
> > > + * programs in the compilation unit.
> > > + *
> > > + * struct tld_keys {
> > > + * tld_key_t prio;
> > > + * tld_key_t in_cs;
> > > + * };
> > > + *
> > > + * Then, for every new task, a bpf program will fetch and cache keys once and
> > > + * for all. This should be done ideally in a non-critical path (e.g., in
> > > + * sched_ext_ops::init_task).
> > > + *
> > > + * struct tld_object tld_obj;
> > > + *
> > > + * err = tld_object_init(task, &tld);
> > > + * if (err)
> > > + * return 0;
> > > + *
> > > + * tld_fetch_key(&tld_obj, "priority", prio);
> > > + * tld_fetch_key(&tld_obj, "in_critical_section", in_cs);
> > > + *
> > > + * Note that, the first argument of tld_fetch_key() is a pointer to tld_object.
> > > + * It should be declared as a stack variable and initialized via tld_object_init().
> > > + *
> > > + * Finally, just like user space programs, bpf programs can get a pointer to a
> > > + * TLD by calling tld_get_data(), with cached keys.
> > > + *
> > > + * int *p;
> > > + *
> > > + * p = tld_get_data(&tld_obj, prio, sizeof(int));
> > > + * if (p)
> > > + * // do something depending on *p
> > > + */
> > > +#include <errno.h>
> > > +#include <bpf/bpf_helpers.h>
> > > +
> > > +#define TLD_DATA_SIZE __PAGE_SIZE
> > > +#define TLD_DATA_CNT 63
> > > +#define TLD_NAME_LEN 62
> > > +
> > > +typedef struct {
> > > + __s16 off;
> > > +} tld_key_t;
> > > +
> > > +struct u_tld_data *dummy_data;
> > > +struct u_tld_metadata *dummy_metadata;
> > > +
> > > +struct tld_metadata {
> > > + char name[TLD_NAME_LEN];
> > > + __u16 size;
> > > +};
> > > +
> > > +struct u_tld_metadata {
> > > + __u8 cnt;
> > > + __u8 padding[63];
> > > + struct tld_metadata metadata[TLD_DATA_CNT];
> > > +};
> > > +
> > > +struct u_tld_data {
> > > + char data[TLD_DATA_SIZE];
> > > +};
> > > +
> > > +struct tld_map_value {
> > > + struct u_tld_data __uptr *data;
> > > + struct u_tld_metadata __uptr *metadata;
> > > +};
> > > +
> > > +struct tld_object {
> > > + struct tld_map_value *data_map;
> > > + struct tld_keys *key_map;
> > > +};
> > > +
> > > +/*
> > > + * Map value of tld_key_map for caching keys. Must be defined by the developer.
> > > + * Members should be tld_key_t and passed to the 3rd argument of tld_fetch_key().
> > > + */
> > > +struct tld_keys;
> > > +
> > > +struct {
> > > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > + __uint(map_flags, BPF_F_NO_PREALLOC);
> > > + __type(key, int);
> > > + __type(value, struct tld_map_value);
> > > +} tld_data_map SEC(".maps");
> > > +
> > > +struct {
> > > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > + __uint(map_flags, BPF_F_NO_PREALLOC);
> > > + __type(key, int);
> > > + __type(value, struct tld_keys);
> > > +} tld_key_map SEC(".maps");
> > > +
> > > +/**
> > > + * tld_object_init() - Initializes a tld_object.
> > > + *
> > > + * @task: The task_struct of the target task
> > > + * @tld_obj: A pointer to a tld_object to be initialized
> > > + *
> > > + * Returns 0 on success; -ENODATA if the task has no TLD; -ENOMEM if the creation
> > > + * of tld_key_map fails
> > > + */
> > > +__attribute__((unused))
> > > +static int tld_object_init(struct task_struct *task, struct tld_object *tld_obj)
> > > +{
> > > + tld_obj->data_map = bpf_task_storage_get(&tld_data_map, task, 0, 0);
> > > + if (!tld_obj->data_map)
> > > + return -ENODATA;
> > > +
> > > + tld_obj->key_map = bpf_task_storage_get(&tld_key_map, task, 0,
> > > + BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > + if (!tld_obj->key_map)
> > > + return -ENOMEM;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * tld_fetch_key() - Fetches the key to a TLD by name. The key has to be created
> > > + * by user space first with tld_create_key().
> > > + *
> > > + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> > > + * @name: The name of the key associated with a TLD
> > > + * @key: The key in struct tld_keys to be saved to
> > > + *
> > > + * Returns a positive integer on success; 0 otherwise
> > > + */
> > > +#define tld_fetch_key(tld_obj, name, key) \
> > > + ({ \
> > > + (tld_obj)->key_map->key.off = __tld_fetch_key(tld_obj, name); \
> > > + })
> > > +
> > > +__attribute__((unused))
> > > +static u16 __tld_fetch_key(struct tld_object *tld_obj, const char *name)
> > > +{
> > > + int i, meta_off, cnt;
> > > + void *metadata, *nm, *sz;
> > > + u16 off = 0;
> > > +
> > > + if (!tld_obj->data_map || !tld_obj->data_map->metadata)
> > > + return 0;
> > > +
> > > + cnt = tld_obj->data_map->metadata->cnt;
> > > + metadata = tld_obj->data_map->metadata->metadata;
> > > +
> > > + bpf_for(i, 0, cnt) {
> > > + meta_off = i * sizeof(struct tld_metadata);
> > > + if (meta_off > TLD_DATA_SIZE - offsetof(struct u_tld_metadata, metadata)
> > > + - sizeof(struct tld_metadata))
> > > + break;
> > > +
> > > + nm = metadata + meta_off + offsetof(struct tld_metadata, name);
> > > + sz = metadata + meta_off + offsetof(struct tld_metadata, size);
> > > +
> > > + /*
> > > + * Reserve 0 for uninitialized keys. Increase the offset of a valid key
> > > + * by one and subtract it later in tld_get_data().
> > > + */
> > > + if (!bpf_strncmp(nm, TLD_NAME_LEN, name))
> > > + return off + 1;
> >
> > I think all this +1, -1 dance is doing is helping to catch an
> > error when tld_get_data() is called without tld_fetch_key().
> > I feel this is too defensive.
> >
> > Let tld_fetch_key() return proper negative error code.
> >
>
> I can certainly return negative error code.
>
> But for the +1, -1 logic, I think is a simpler way to differentiate an
> uninitialized key in tld_key_map from the first TLD (both key.off ==
> 0). After all, bpf programs can call tld_get_data() without
> tld_fetch_key().
I'm saying we don't need to catch this case.
progs should not call tld_get_data() without tld_fetch_key().
If they do, it's a bug.
>
> > > +
> > > + off += *(u16 *)sz;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * tld_get_data() - Retrieves a pointer to the TLD associated with the key.
> > > + *
> > > + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> > > + * @key: The key of a TLD saved in tld_maps
> > > + * @size: The size of the TLD. Must be a known constant value
> > > + *
> > > + * Returns a pointer to the TLD data associated with the key; NULL if the key
> > > + * is not valid or the size is too big
> > > + */
> > > +#define tld_get_data(tld_obj, key, size) \
> > > + __tld_get_data(tld_obj, (tld_obj)->key_map->key.off - 1, size)
> > > +
> > > +__attribute__((unused))
> > > +__always_inline void *__tld_get_data(struct tld_object *tld_obj, u32 off, u32 size)
> > > +{
> > > + return (tld_obj->data_map->data && off >= 0 && off < TLD_DATA_SIZE - size) ?
> > > + (void *)tld_obj->data_map->data + off : NULL;
> >
> > If we require users to call tld_fetch_key() first and check for errors
> > tld_get_data() can be faster:
> > #define tld_get_data(tld_obj, key)
> > (void *)tld_obj->data_map->data + tld_obj->key_map->key.off
> >
>
> tld_get_data() can be called in a bpf program without tld_fetch_key(),
> so checking tld_obj->data_map->data is needed as the first load from
> tld_obj->data_map->data yields a "mem_or_null".
>
> I did try to save this uptr "mem" after null check to stack (e.g., in
> a tld_object) so that we can save subsequent checks. However, the
> compiler sometime will do a fresh load from map_ptr when reading
> tld_obj->data_map->data. Might need some work or trick to make it
> happen.
likely because you do tld_obj->data_map->data twice.
> > I wouldn't bother with extra checks, especially for size.
> >
>
> It needs to be bound-checked. If tld_get_data() doesn't do it, bpf
> programs have to do it before acceess. Otherwise:
>
> ; return (tld_obj->data_map->data && off >= 0) ? @ task_local_data.bpf.h:218
> 25: (bf) r3 = r1 ; R1_w=mem(sz=4096) R3_w=mem(sz=4096)
> 26: (0f) r3 += r2 ;
> R2_w=scalar(smin=0,smax=umax=0xffffffff,smin32=0xffff7fff,smax32=32766,var_off=(0x0;
> 0xffffffff)) R3_w=mem(sz=4096,smin=0,smax=umax=0xffffffff,var_off=(0x0;
> 0xfffffff)
> ; test_value1 = *int_p; @ test_task_local_data.c:63
> 27: (61) r2 = *(u32 *)(r3 +0)
> R3 unbounded memory access, make sure to bounds check any such access
That's easy to fix.
Then something like:
#define tld_get_data(tld_obj, key) \
({
void * data = tld_obj->data_map->data;
if (data)
data += tld_obj->key_map->key.off & (PAGE_SIZE - 1);
data;
})
size is really not needed. The verifier sees it as one page.
Bad bpf prog can write into the wrong key and the verifier cannot stop it.
> > Bigger question.. can we cache the whole pointer for each key ?
> > and then
> > #define tld_get_data(tld_obj, key) ld_obj->key_map->key
>
> Maybe define the member type of tld_key_map as __uptr and allow bpf
> programs to update a uptr field with a valid uptr?
yeah. That indeed gets complicated. Maybe it's possible with some
verifier changes, but let's not go there yet.
The tld_get_data() proposed above is speedy enough.
Powered by blists - more mailing lists