netdev - Re: [PATCH RFC v3 0/2] Task local data API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzY3oYWkUshYD7ybiB5bcGoLnQxukYObmgRtRZoEi=ZMTw@mail.gmail.com>
Date: Thu, 1 May 2025 19:22:31 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Amery Hung <ameryhung@...il.com>, bpf <bpf@...r.kernel.org>, 
	Network Development <netdev@...r.kernel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Tejun Heo <tj@...nel.org>, 
	Martin KaFai Lau <martin.lau@...nel.org>, Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH RFC v3 0/2] Task local data API

On Thu, May 1, 2025 at 4:27 PM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
> <andrii.nakryiko@...il.com> wrote:
> >
> > Lack of support for shared libraries is a big limitation, IMO, so I
> > think we should design for that support from the very beginning.
> >
> > FWIW, I think this compile-time static __thread variable definitions
> > are unnecessarily limiting and add a non-trivial amount of complexity.
> > I think it's more reasonable to go with a more dynamic approach, and
> > I'll try to outline what an API (and some implementation details)
> > might look like.
> >
> > Note, all this is related to the user-space side of things. BPF-side
> > is unchanged, effectively, except you get a guarantee that all the
> > data will definitely be page-aligned, so you won't need to do these
> > two uptr pages handling.
> >
> > First, data structures. You'll have one per-process metadata structure
> > describing known keys and their assigned "offsets":
> >
> > struct tld_metadata {
> >     int cnt;
> >     char keys[MAX_KEY_CNT];
> >     __u16 offs[MAX_KEY_CNT];
> >     __u16 szs[MAX_KEY_CNT];
> > };
>
> In Amery's proposal it's
> u16 key_cnt;
> struct {
>   char key[TASK_LOCAL_DATA_KEY_LEN];
>   u16 off;
> } keys[];
>
> If we start allocation from the end of the page then we
> don't need 'szs' array.

I wasn't trying to optimize those few bytes taken by szs, tbh.
Allocating from the end of the page bakes in the assumption that we
won't ever need more than one page. I don't know if I'd do that. But
we can just track "next available offset" instead, so it doesn't
really matter much.

>
> > typedef struct { int off; } tld_off_t;
>
> This is a good tweak.
>
> > tld_off_t tld_add_key_def(const char *key_name, size_t key_size);
>
> Dropping bpf_ prefix and sticking to tld_ is a good move too.
>
> > This API can be called just once per each key that process cares
> > about. And this can be done at any point, really, very dynamically.
> > The implementation will:
> >   - (just once per process) open pinned BPF map, remember its FD;
> >   - (just once) allocate struct tld_metadata, unless we define it as
> > pre-allocated global variable;
>
> The main advantage of the new scheme is support for shared libraries.
> That's a big plus, but the requirement that everything (shared libs,
> static libs, application itself) has to go through this library
> (it would have to be shared ?) is quite inconvenient imo.

It's simple enough to be contained within a single .h file. Everything
else is just __weak symbols for per-process state, so I'm not sure
this is a limitation of any sort.

>
> Let's tweak this proposal and make the kernel the source of truth.
> Then shared libs, static and application would only need to agree
> on the protocol of accessing bpf task local storage instead of
> being forced to use this "tld" library because it's stateful.
> "tld" library will be there, but it will be stateless.

Yes, data structures and the protocol of accessing and synchronizing
the updates is what matters.

>
> We may need to tweak the kernel uptr api a bit, but at a high level
> let all user space things assume that two __uptr-s in that special
> task local storage map are the source of truth.
> First, the user space (.so, .a or a.out) will do is
> map_lookup_elem(tls_map_fd) and if __uptr are pointing somewhere
> use that memory regardless of which part of this process allocated it
> (assuming that .so, .a or a.out won't be freeing it until exit).
> If __uptr udata is still NULL, allocate page-aligned page and
> map_update_elem(). The kernel doesn't need to deal with races
> here since it's a task local value.

Sure. Keep in mind, though, that in practice, whatever different
libraries constitute your application, they all will need to agree on
how the BPF task local storage map FD is obtained.

> If __uptr umetadata is there after lookup, use that to find
> and add keys. Here we need some generic locking mechanism,
> since umetadata is per-process. Like test-and-set spinlock
> that sits within umetadata region and all .so-s, .a-s, a.out
> have the same algorithm to lock and access it.

Yep, see below.

>
> If not there, allocate a page of umetadata and map_update_elem().
> Here the kernel would need to deal with races, but I hope
> BPF_NOEXIST should already work correctly? I haven't checked.
> Worst case we'd need to add support for BPF_F_LOCK (if not already there).

yeah, BPF_NOEXIST should work, I wouldn't do BPF_F_LOCK.

>
> >   - (locklessly) check if key_name is already in tld_metadata, if yes
> > - return already assigned offset;
> >   - (locklessly) if not, add this key and assign it offset that is
> > offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> > values (though we should take care of alignment requirements, of
> > course);
>
> I'm not quite sure how different processes can do it locklessly.

There are no different processes, it's all one process, many
threads... Or is that what you meant? tld_metadata is *per process*,
tld_data is *per thread*. Processes don't need to coordinate anything
between themselves, only threads within the process.

As for how I'd do offset allocation and key addition locklessly. You
are right that it can't be done completely locklessly, but just
looping and yielding probably would be fine.
=

Then the sequence of adding the key would be something like below.
I've modified tld_metadata a bit to make this simpler and more
economical (and I fixed definition of keys array of array of chars,
oops):

struct tld_metadata {
    int cnt;
    int next_off;
    char keys[MAX_KEY_CNT][MAX_KEY_LEN];
    __u16 offs[MAX_KEY_CNT];
};

struct tld_metadata *m = ...;
const char *new_key = ...;
int i = 0;

/* all m->offs[i] are set to -1 on creation */
again:

    int key_cnt = m->cnt;
    for (; i < key_cnt; i++) {
       while (m->offs[i] < 0) /* update in progress */
            sched_yield();

       if (strcmp(m->keys[i], new_key) == 0)
            return m->offs[i];

       if (!cmpxchg(*m->cnt, key_cnt, key_cnt + 1)) {
            goto again; /* we raced, key might have been added
already, recheck, but keep i */

       /* slot key_cnt is ours, we need to calculate and assign offset */
       int new_off = m->next_off;
       m->next_off = new_off + key_sz;

       m->keys[key_cnt][0] = '\0';
       strncat(m->keys[key_cnt], new_key, MAX_KEY_LEN);

       /* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */

       m->offs[key_cnt] = new_off; /* this is finalizing key -> offset
assignment */

       /* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */

       return new_off; /* we are done */
    }

Something like that. There is that looping and yield to not miss
someone else winning the race and adding a key, so that's the locking
part. But given that adding a key definition is supposed to be one
time operation (per key), I don't think we should be fancy with
locking.

> But if we allocate from the top of the page and there is only one
> 'offset' field then we can do it lockless. Like:
> allocate(size sz)
> {
>        size_t old = atomic_read(&obj->offset);
>
>        if ((ssize_t)(old - sz) >= 0 &&
>            atomic_cmpxchg(&obj->offset, old, old - sz) == old)
>                return ..; // success

does it really matter whether it's from the top or bottom of the page?


> }
>
> No suggestions for the rest of the proposal.