netdev - Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMB2axNTyTaqBcFRLQ0VueztMdunv71jygeij00gDNY7i9=K6A@mail.gmail.com>
Date: Mon, 19 May 2025 23:44:24 -0700
From: Amery Hung <ameryhung@...il.com>
To: Tejun Heo <tj@...nel.org>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org, alexei.starovoitov@...il.com, 
	andrii@...nel.org, daniel@...earbox.net, memxor@...il.com, 
	martin.lau@...nel.org, kernel-team@...a.com
Subject: Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data

On Mon, May 19, 2025 at 1:04 PM Tejun Heo <tj@...nel.org> wrote:
>
> Hello,
>
> On Thu, May 15, 2025 at 02:16:00PM -0700, Amery Hung wrote:
> ...
> > +#define PAGE_SIZE 4096
>
> This might conflict with other definitions. Looks like non-4k page sizes are
> a lot more popular on arm. Would this be a problem?
>
> > +static int __tld_init_metadata(int map_fd)
> > +{
> > +     struct u_tld_metadata *new_metadata;
> > +     struct tld_map_value map_val;
> > +     int task_fd = 0, err;
> > +
> > +     task_fd = syscall(SYS_pidfd_open, getpid(), 0);
> > +     if (task_fd < 0) {
> > +             err = -errno;
> > +             goto out;
> > +     }
> > +
> > +     new_metadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
>
> Is 4k size limit from UPTR? Is it still 4k on machines with >4k pages? If
> this isn't a hard limit from UPTR, would it make sense to encode the size in
> the header part of the metadata?
>

UPTR size limit is a page. I will make PAGE_SIZE arch dependent. For
metadata, since all threads of a process share one metadata page, I
think it is okay to make it a fixed size. For data, I think it makes
sense to encode it in the header of metadata.

> > +static int __tld_init_data(int map_fd)
> > +{
> > +     struct u_tld_data *new_data = NULL;
> > +     struct tld_map_value map_val;
> > +     int err, task_fd = 0;
> > +
> > +     task_fd = syscall(SYS_pidfd_open, gettid(), PIDFD_THREAD);
> > +     if (task_fd < 0) {
> > +             err = -errno;
> > +             goto out;
> > +     }
> > +
> > +     new_data = aligned_alloc(PAGE_SIZE, TLD_DATA_SIZE);
>
> Ditto.
>
> Noob question. Does this means that each thread will map a 4k page no matter
> how much data it actually uses?

Unfortunately this is the case currently, but hey maybe we can make
data size dynamic

>
> > +__attribute__((unused))
> > +static tld_key_t tld_create_key(int map_fd, const char *name, size_t size)
> > +{
> > +     int err, i, cnt, sz, off = 0;
> > +
> > +     if (!READ_ONCE(tld_metadata_p)) {
> > +             err = __tld_init_metadata(map_fd);
> > +             if (err)
> > +                     return (tld_key_t) {.off = err};
> > +     }
> > +
> > +     if (!tld_data_p) {
> > +             err = __tld_init_data(map_fd);
> > +             if (err)
> > +                     return (tld_key_t) {.off = err};
> > +     }
> > +
> > +     size = round_up(size, 8);
> > +
> > +     for (i = 0; i < TLD_DATA_CNT; i++) {
> > +retry:
> > +             cnt = __atomic_load_n(&tld_metadata_p->cnt, __ATOMIC_RELAXED);
> > +             if (i < cnt) {
> > +                     /*
> > +                      * Pending tld_create_key() uses size to signal if the metadata has
> > +                      * been fully updated.
> > +                      */
> > +                     while (!(sz = __atomic_load_n(&tld_metadata_p->metadata[i].size,
> > +                                                   __ATOMIC_ACQUIRE)))
> > +                             sched_yield();
> > +
> > +                     if (!strncmp(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN))
> > +                             return (tld_key_t) {.off = -EEXIST};
> > +
> > +                     off += sz;
> > +                     continue;
> > +             }
> > +
> > +             if (off + size > TLD_DATA_SIZE)
> > +                     return (tld_key_t) {.off = -E2BIG};
> > +
> > +             /*
> > +              * Only one tld_create_key() can increase the current cnt by one and
> > +              * takes the latest available slot. Other threads will check again if a new
> > +              * TLD can still be added, and then compete for the new slot after the
> > +              * succeeding thread update the size.
> > +              */
> > +             if (!__atomic_compare_exchange_n(&tld_metadata_p->cnt, &cnt, cnt + 1, true,
> > +                                              __ATOMIC_RELAXED, __ATOMIC_RELAXED))
> > +                     goto retry;
> > +
> > +             strncpy(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN);
> > +             __atomic_store_n(&tld_metadata_p->metadata[i].size, size, __ATOMIC_RELEASE);
> > +             return (tld_key_t) {.off = off};
> > +     }
> > +
> > +     return (tld_key_t) {.off = -ENOSPC};
> > +}
>
> This looks fine to me but I wonder whether run-length encoding the key
> strings would be more efficient and less restrictive in terms of key length.
> e.g.:
>
> struct key {
>         u32 data_len;
>         u16 key_off;
>         u16 key_len;
> };
>
> struct metadata {
>         struct key      keys[MAX_KEYS];
>         char            key_strs[SOME_SIZE];
> };
>
> The logic can be mostly the same. The only difference would be that key
> string is not inline. Determine winner in the creation path by compxchg'ing
> on data_len, but set key_off and key_len only after key string is updated.
> Losing on cmpxhcg or seeing an entry where key_len is zero means that that
> one lost and should relax and retry. It can still use the same 4k metadata
> page but will likely be able to allow more keys while also relaxing
> restrictions on key length.
>
> Hmm... maybe making the key string variably sized makes things difficult for
> the BPF code. If so (or for any other reasons), please feel free to ignore
> the above.

I think this is a great suggestion. The current implementation may
waste spaces in metadata if a key does not use all 62 bytes. I don't
see an obvious obstacle in bpf. I will try to incorporate this in the
next respin.

>
> > +#endif /* __TASK_LOCAL_DATA_H */
> > diff --git a/tools/testing/selftests/bpf/progs/task_local_data.bpf.h b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> > new file mode 100644
> > index 000000000000..5f48e408a5e5
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> ...
> > +/**
> > + * tld_get_data() - Retrieves a pointer to the TLD associated with the key.
> > + *
> > + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> > + * @key: The key of a TLD saved in tld_maps
> > + * @size: The size of the TLD. Must be a known constant value
> > + *
> > + * Returns a pointer to the TLD data associated with the key; NULL if the key
> > + * is not valid or the size is too big
> > + */
> > +#define tld_get_data(tld_obj, key, size) \
> > +     __tld_get_data(tld_obj, (tld_obj)->key_map->key.off - 1, size)
> > +
> > +__attribute__((unused))
> > +__always_inline void *__tld_get_data(struct tld_object *tld_obj, u32 off, u32 size)
> > +{
> > +     return (tld_obj->data_map->data && off >= 0 && off < TLD_DATA_SIZE - size) ?
> > +             (void *)tld_obj->data_map->data + off : NULL;
> > +}
>
> Neat.
>
> Generally looks great to me. The only thing I wonder is whether the data
> area sizing can be determined at init time rather than fixed to 4k.
>

I think we can achieve it by first limiting tld_create_key() to the
init phase (i.e., only calling them in C/C++ constructor). Then,
tld_create_key() will not allocate memory for data. Instead, on the
first call to tld_get_data(), we freeze the size of the data area and
allocate the memory just enough or round up to the power of two.

For C, we can define a new macro API (e.g., tld_define_key()) that
generates a __attribute__((constructor)) function that in turn calls
tld_create_key().

> Thanks.
>
> --
> tejun