linux-kernel - Re: [PATCH 0/3] further damage-control lack of clone scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAGudoHEKMM575UDHpVsbPfOKQQPh55+EKgTkJbbb_oV_kjbzuQ@mail.gmail.com>
Date: Mon, 24 Nov 2025 05:03:39 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: oleg@...hat.com, brauner@...nel.org, linux-kernel@...r.kernel.org, 
	akpm@...ux-foundation.org, linux-mm@...ck.org
Subject: Re: [PATCH 0/3] further damage-control lack of clone scalability

On Sun, Nov 23, 2025 at 11:33 PM Mateusz Guzik <mjguzik@...il.com> wrote:
> Ultimately in order to makes this scale CPUs need to stop sharing the
> locks (in the common case anyway). To that end PID space needs to get
> partitioned, with ranges allocated to small sets of CPUs (say 8 or 10
> or similar -- small enough for contention to not matter outside of a
> microbenchmark). The ID space on Linux is enormous, so this should not
> be a problem. The only potential issue is that PIDs no longer will be
> sequential which is userspace-visible (I mean, suppose a wanker knows
> for a fact nobody is forking and insists on taking advantage of it (by
> Hyrum's law), then his two forks in a row have predictable IDs).
> Anyhow, then most real workloads should be able to allocate IDs
> without depleting the range on whatever they happen to be running on
> (but deallocating may still need to look at other CPU ranges).
>
> With something like a bitmap + hash this is trivially achievable --
> just have a set of bitmaps and have each assigned a 'base' which you
> add/subtract on the id obtained from the given bitmap. The hash is a
> non-problem.
>

So I had a little bit of a think and I got something and it boils down
to special casing the last level of a (quasi)radix tree. It provides
the scalability I was looking for, albeit with some uglification. This
is a sketch.

Note that as port of allocation policy the kernel must postpone pids
reuse, as in you can't just free/alloc in a tiny range and call it a
day.

Part of the issue for me is that there are 32 levels of allowed
namespaces. The stock code will relock pidmap_lock for every single
one(!) on alloc, this is just terrible. Suppose one introduces
per-namespace locks, that's still 32 lock trips to grab a pid and that
still does not solve the scalability problem. For my taste that's
questionable at best, but at the same time this is what the kernel is
already doing, so let's pretend for a minute the relocks are not a
concern.

The solution is based on (quasi)radix where the datum is a pointer to
a struct containing a spinlock, bitmap and array of pids. Likely will
be a xarray, but I'm going to keep typing radix for the time being as
this is the concept which matters.

The struct contains a carved out id range and can fit -- say -- 250
entries or so, or whatever else which fits in a size considered
fine(tm). The actual pid is obtained by adding up the radix id (which
serves as a prefix) and the offset into the array.

In order to alloc a pid for a given namespace, the calling CPU would
check if it has a range carved out. If so, it locks the thing and
looks for a free pid. Absent a free pid or a range in the first place
it goes to xarray to get space. This avoids synchronisation with other
CPUs for the 250 forks (modulo a thread with an id from this range
exiting on a different cpu), which sorts out the scalability problem
in practical settings.

Of course once someone notices that there are no IDs in use anymore
*and* the last id was handed out at some point, the range gets freed.

But you have to do the locking for every ns.

So let's say it is in fact a problem and it would be most preferred if
the CPU could take *one* lock and stick with it for all namespaces,
all while retaining scalability.

Instead, every CPU could have its own pidmap_lock. The struct with the
bitmap + pid array would have a pointer to a spinlock, which would
refer to the pidmap lock of whichever CPU which allocated the range.

Et voila, allocs still get away with one lock acquire in the common
case where there are free ids in all namespaces which need to be
touched. Contention is only present on xarray locks if you ran out of
space *or* on the pidmap lock if someone is freeing an id. Hell, this
can probably be made lockless to further speed it up if need be.
However, lockless or not, the key point is that most allocs will *not*
be bouncing any cachelines.