linux-kernel - Re: [PATCH 0/3] further damage-control lack of clone scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFKo8mgxPYVitf9gGWAtQwJfHos_k-zB-hk=4FiCiYVPA@mail.gmail.com>
Date: Sun, 23 Nov 2025 23:33:47 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: oleg@...hat.com, brauner@...nel.org, linux-kernel@...r.kernel.org, 
	akpm@...ux-foundation.org, linux-mm@...ck.org
Subject: Re: [PATCH 0/3] further damage-control lack of clone scalability

On Sun, Nov 23, 2025 at 10:45 PM Matthew Wilcox <willy@...radead.org> wrote:
>
> On Sun, Nov 23, 2025 at 05:39:16PM +0100, Mateusz Guzik wrote:
> > I have some recollection we talked about this on irc long time ago.
> >
> > It is my *suspicion* this would be best served with a sparse bitmap +
> > a hash table.
>
> Maybe!  I've heard other people speculate that would be a better data
> structure.  I know we switched away from a hash table for the page
> cache, but that has a different usage pattern where it's common to go
> from page N to page N+1, N+2, ...  Other than ps, I don't think we often
> have that pattern for PIDs.
>

Well it does seem like the obvious choice to try, and like I said it
even was tried in Linux. It also happens to be what FreeBSD is doing
and it gets better scalability, despite other global serialization
points (that is to say both kernels *suck* on this front, just so
happens FreeBSD sucks less).

> > Such a solution was already present, but it got replaced by
> > 95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
> > API").
> >
> > Commit message cites the following bench results:
> >     The following are the stats for ps, pstree and calling readdir on /proc
> >     for 10,000 processes.
> >
> >     ps:
> >             With IDR API    With bitmap
> >     real    0m1.479s        0m2.319s
> >     user    0m0.070s        0m0.060s
> >     sys     0m0.289s        0m0.516s
> >
> >     pstree:
> >             With IDR API    With bitmap
> >     real    0m1.024s        0m1.794s
> >     user    0m0.348s        0m0.612s
> >     sys     0m0.184s        0m0.264s
> >
> >     proc:
> >             With IDR API    With bitmap
> >     real    0m0.059s        0m0.074s
> >     user    0m0.000s        0m0.004s
> >     sys     0m0.016s        0m0.016s
> >
> > Impact on clone was not benchmarked afaics.
>
> It shouldn't be too much effort for you to check out 95846ecf9dac5089
> and 95846ecf9dac5089^ to run your benchmark on both?  That would seem
> like the cheapest way of assessing the performance of hash+bitmap
> vs IDR.
>

The commit is 2017 vintage and that kernel has some problems I fixed
in the area. The commit is not easily revertable either.

Even then the hash as implemented at the time looks suspicious.

I best way forward it to slap together a poc consisting of the obvious
vmalloced bitmap for the entire size + hasthable and see if that has
legs. If so, then one can spend time making it memory-efficient.

> > Regardless, in order to give whatever replacement a fair perf eval
> > against idr, at least the following 2 bits need to get sorted out:
> > - the self-induced repeat locking of pidmap_lock
> > - high cost of kmalloc (to my understanding waiting for sheaves4all)
>
> The nice thing about XArray (compared to IDR) is that there's no
> requirement to preallocate.  Only 1.6% of xa_alloc() calls result in
> calling slab.

Technically there is no requirement in idr either, as it also kmallocs
with GFP_ATOMIC inside.

> At one point I kind of had a plan to create a multi-xarray where you had
> multiple xarrays that shared a single lock.  Or maybe this sharding is
> exactly what's needed; I haven't really analysed the pid locking to see
> what's needed.

So ignoring bottlenecks outside of pid management, there is the global
pidmap_lock. The xarray patchset I linked above technically speaking
provides per-namespace locking, but that does not cut it as you have
to lock all namespaces anyway, including the global one.

Ultimately in order to makes this scale CPUs need to stop sharing the
locks (in the common case anyway). To that end PID space needs to get
partitioned, with ranges allocated to small sets of CPUs (say 8 or 10
or similar -- small enough for contention to not matter outside of a
microbenchmark). The ID space on Linux is enormous, so this should not
be a problem. The only potential issue is that PIDs no longer will be
sequential which is userspace-visible (I mean, suppose a wanker knows
for a fact nobody is forking and insists on taking advantage of it (by
Hyrum's law), then his two forks in a row have predictable IDs).
Anyhow, then most real workloads should be able to allocate IDs
without depleting the range on whatever they happen to be running on
(but deallocating may still need to look at other CPU ranges).

With something like a bitmap + hash this is trivially achievable --
just have a set of bitmaps and have each assigned a 'base' which you
add/subtract on the id obtained from the given bitmap. The hash is a
non-problem.

I don't know about xarray. Bottom line though, even if one was to pass
a lock around to xarray primitives but without sorting out the range
lock problem, that would merely be a stopgap.

All that said, I don't know if/when I'll get around to something this.

I think my addition to the idr api is trivial and absent someone
volunteering to do the needful(tm) sooner than later it should not be
considered a blocker.

My worry is, should this patchset get stalled, someone is going to
swoop in and submit changes which make it infeasible to only take the
lock once.