lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGudoHETe9=un510HBh6-rwwyoE+qHYNLoxDK8-suTgbTfN++w@mail.gmail.com>
Date: Sun, 18 Jan 2026 13:51:48 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: oleg@...hat.com
Cc: brauner@...nel.org, linux-kernel@...r.kernel.org, 
	akpm@...ux-foundation.org, linux-mm@...ck.org, willy@...radead.org
Subject: Re: [PATCH v3 0/2] further damage-control lack of clone scalability

On Sat, Dec 6, 2025 at 2:20 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
>
> Benchmark code at the end.
>
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
>
> I believe idr is a poor choice for the task at hand to begin with, but
> sorting out that out beyond the scope of this patchset. At the same time
> any replacement would be best evaluated against a state where the
> above relock problem is fixed.
>
> Performance improvement varies between reboots. When benchmarking with
> 20 processes creating and killing threads in a loop, the unpatched
> baseline hovers around 465k ops/s, while patched is anything between
> ~510k ops/s and ~560k depending on false-sharing (which I only minimally
> sanitized). So this is at least 10% if you are unlucky.
>

I had another look at the problem concerning steps after this patchset.

I found the primary problem is pidfs support -- commenting it out
gives me about 40% boost, afterwards top of the profile is cgroups
with its 3 lock acquires per thread life cycle:

@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irqsave+45
    cgroup_task_dead+33
    finish_task_switch.isra.0+555
    schedule_tail+11
    ret_from_fork+27
    ret_from_fork_asm+26
]: 2550200
@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irq+38
    cgroup_post_fork+57
    copy_process+5993
    kernel_clone+148
    __do_sys_clone3+188
    do_syscall_64+78
    entry_SYSCALL_64_after_hwframe+118
]: 3486368
@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irq+38
    cgroup_can_fork+110
    copy_process+4940
    kernel_clone+148
    __do_sys_clone3+188
    do_syscall_64+78
    entry_SYSCALL_64_after_hwframe+118
]: 3487665

currently the pidfs thing is implemented with a red black tree.

Whatever the replacement it should be faster and have its own
non-global locking.

I don't know what's available in the kernel to deal with it instead.
Is it rhashtable?

I would not mind whatsoever if someone else dealt with it. :-)

> bench from will-it-scale:
>
> #include <assert.h>
> #include <pthread.h>
>
> char *testcase_description = "Thread creation and teardown";
>
> static void *worker(void *arg)
> {
>         return (NULL);
> }
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
>         pthread_t thread[1];
>         int error;
>
>         while (1) {
>                 for (int i = 0; i < 1; i++) {
>                         error = pthread_create(&thread[i], NULL, worker, NULL);
>                         assert(error == 0);
>                 }
>                 for (int i = 0; i < 1; i++) {
>                         error = pthread_join(thread[i], NULL);
>                         assert(error == 0);
>                 }
>                 (*iterations)++;
>         }
> }
>
> v3:
> - fix some whitespace and one typo
> - slightly reword the ENOMEM comment
> - move i-- in the first loop towards the end for consistency with the
>   other loop
> - 2 extra unlikely for initial error conditions
>
> I retained Oleg's r-b as the changes don't affect behavior
>
> v2:
> - cosmetic fixes from Oleg
> - drop idr_preload_many, relock pidmap + call idr_preload again instead
> - write a commit message
>
> Mateusz Guzik (2):
>   ns: pad refcount
>   pid: only take pidmap_lock once on alloc
>
>  include/linux/ns/ns_common_types.h |   4 +-
>  kernel/pid.c                       | 134 ++++++++++++++++++-----------
>  2 files changed, 89 insertions(+), 49 deletions(-)
>
> --
> 2.48.1
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ