netdev - Re: [PATCH RFC v3 0/2] Task local data API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAADnVQL92e=-Nzr0O5Geev4y7cWG2m1UR_D7izF+Rd2ccPMNKQ@mail.gmail.com>
Date: Thu, 1 May 2025 16:26:55 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Amery Hung <ameryhung@...il.com>, bpf <bpf@...r.kernel.org>, 
	Network Development <netdev@...r.kernel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Tejun Heo <tj@...nel.org>, 
	Martin KaFai Lau <martin.lau@...nel.org>, Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH RFC v3 0/2] Task local data API

On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> Lack of support for shared libraries is a big limitation, IMO, so I
> think we should design for that support from the very beginning.
>
> FWIW, I think this compile-time static __thread variable definitions
> are unnecessarily limiting and add a non-trivial amount of complexity.
> I think it's more reasonable to go with a more dynamic approach, and
> I'll try to outline what an API (and some implementation details)
> might look like.
>
> Note, all this is related to the user-space side of things. BPF-side
> is unchanged, effectively, except you get a guarantee that all the
> data will definitely be page-aligned, so you won't need to do these
> two uptr pages handling.
>
> First, data structures. You'll have one per-process metadata structure
> describing known keys and their assigned "offsets":
>
> struct tld_metadata {
>     int cnt;
>     char keys[MAX_KEY_CNT];
>     __u16 offs[MAX_KEY_CNT];
>     __u16 szs[MAX_KEY_CNT];
> };

In Amery's proposal it's
u16 key_cnt;
struct {
  char key[TASK_LOCAL_DATA_KEY_LEN];
  u16 off;
} keys[];

If we start allocation from the end of the page then we
don't need 'szs' array.

> typedef struct { int off; } tld_off_t;

This is a good tweak.

> tld_off_t tld_add_key_def(const char *key_name, size_t key_size);

Dropping bpf_ prefix and sticking to tld_ is a good move too.

> This API can be called just once per each key that process cares
> about. And this can be done at any point, really, very dynamically.
> The implementation will:
>   - (just once per process) open pinned BPF map, remember its FD;
>   - (just once) allocate struct tld_metadata, unless we define it as
> pre-allocated global variable;

The main advantage of the new scheme is support for shared libraries.
That's a big plus, but the requirement that everything (shared libs,
static libs, application itself) has to go through this library
(it would have to be shared ?) is quite inconvenient imo.

Let's tweak this proposal and make the kernel the source of truth.
Then shared libs, static and application would only need to agree
on the protocol of accessing bpf task local storage instead of
being forced to use this "tld" library because it's stateful.
"tld" library will be there, but it will be stateless.

We may need to tweak the kernel uptr api a bit, but at a high level
let all user space things assume that two __uptr-s in that special
task local storage map are the source of truth.
First, the user space (.so, .a or a.out) will do is
map_lookup_elem(tls_map_fd) and if __uptr are pointing somewhere
use that memory regardless of which part of this process allocated it
(assuming that .so, .a or a.out won't be freeing it until exit).
If __uptr udata is still NULL, allocate page-aligned page and
map_update_elem(). The kernel doesn't need to deal with races
here since it's a task local value.
If __uptr umetadata is there after lookup, use that to find
and add keys. Here we need some generic locking mechanism,
since umetadata is per-process. Like test-and-set spinlock
that sits within umetadata region and all .so-s, .a-s, a.out
have the same algorithm to lock and access it.

If not there, allocate a page of umetadata and map_update_elem().
Here the kernel would need to deal with races, but I hope
BPF_NOEXIST should already work correctly? I haven't checked.
Worst case we'd need to add support for BPF_F_LOCK (if not already there).

>   - (locklessly) check if key_name is already in tld_metadata, if yes
> - return already assigned offset;
>   - (locklessly) if not, add this key and assign it offset that is
> offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> values (though we should take care of alignment requirements, of
> course);

I'm not quite sure how different processes can do it locklessly.
But if we allocate from the top of the page and there is only one
'offset' field then we can do it lockless. Like:
allocate(size sz)
{
       size_t old = atomic_read(&obj->offset);

       if ((ssize_t)(old - sz) >= 0 &&
           atomic_cmpxchg(&obj->offset, old, old - sz) == old)
               return ..; // success
}

No suggestions for the rest of the proposal.