[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGudoHETe9=un510HBh6-rwwyoE+qHYNLoxDK8-suTgbTfN++w@mail.gmail.com>
Date: Sun, 18 Jan 2026 13:51:48 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: oleg@...hat.com
Cc: brauner@...nel.org, linux-kernel@...r.kernel.org,
akpm@...ux-foundation.org, linux-mm@...ck.org, willy@...radead.org
Subject: Re: [PATCH v3 0/2] further damage-control lack of clone scalability
On Sat, Dec 6, 2025 at 2:20 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
>
> Benchmark code at the end.
>
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
>
> I believe idr is a poor choice for the task at hand to begin with, but
> sorting out that out beyond the scope of this patchset. At the same time
> any replacement would be best evaluated against a state where the
> above relock problem is fixed.
>
> Performance improvement varies between reboots. When benchmarking with
> 20 processes creating and killing threads in a loop, the unpatched
> baseline hovers around 465k ops/s, while patched is anything between
> ~510k ops/s and ~560k depending on false-sharing (which I only minimally
> sanitized). So this is at least 10% if you are unlucky.
>
I had another look at the problem concerning steps after this patchset.
I found the primary problem is pidfs support -- commenting it out
gives me about 40% boost, afterwards top of the profile is cgroups
with its 3 lock acquires per thread life cycle:
@[
__pv_queued_spin_lock_slowpath+1
_raw_spin_lock_irqsave+45
cgroup_task_dead+33
finish_task_switch.isra.0+555
schedule_tail+11
ret_from_fork+27
ret_from_fork_asm+26
]: 2550200
@[
__pv_queued_spin_lock_slowpath+1
_raw_spin_lock_irq+38
cgroup_post_fork+57
copy_process+5993
kernel_clone+148
__do_sys_clone3+188
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+118
]: 3486368
@[
__pv_queued_spin_lock_slowpath+1
_raw_spin_lock_irq+38
cgroup_can_fork+110
copy_process+4940
kernel_clone+148
__do_sys_clone3+188
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+118
]: 3487665
currently the pidfs thing is implemented with a red black tree.
Whatever the replacement it should be faster and have its own
non-global locking.
I don't know what's available in the kernel to deal with it instead.
Is it rhashtable?
I would not mind whatsoever if someone else dealt with it. :-)
> bench from will-it-scale:
>
> #include <assert.h>
> #include <pthread.h>
>
> char *testcase_description = "Thread creation and teardown";
>
> static void *worker(void *arg)
> {
> return (NULL);
> }
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
> pthread_t thread[1];
> int error;
>
> while (1) {
> for (int i = 0; i < 1; i++) {
> error = pthread_create(&thread[i], NULL, worker, NULL);
> assert(error == 0);
> }
> for (int i = 0; i < 1; i++) {
> error = pthread_join(thread[i], NULL);
> assert(error == 0);
> }
> (*iterations)++;
> }
> }
>
> v3:
> - fix some whitespace and one typo
> - slightly reword the ENOMEM comment
> - move i-- in the first loop towards the end for consistency with the
> other loop
> - 2 extra unlikely for initial error conditions
>
> I retained Oleg's r-b as the changes don't affect behavior
>
> v2:
> - cosmetic fixes from Oleg
> - drop idr_preload_many, relock pidmap + call idr_preload again instead
> - write a commit message
>
> Mateusz Guzik (2):
> ns: pad refcount
> pid: only take pidmap_lock once on alloc
>
> include/linux/ns/ns_common_types.h | 4 +-
> kernel/pid.c | 134 ++++++++++++++++++-----------
> 2 files changed, 89 insertions(+), 49 deletions(-)
>
> --
> 2.48.1
>
Powered by blists - more mailing lists