linux-kernel - Re: [PATCH v3 0/2] execve scalability issues, part 1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZOjFSwGNxMDT6bqm@snowbird>
Date:   Fri, 25 Aug 2023 08:14:19 -0700
From:   Dennis Zhou <dennis@...nel.org>
To:     Mateusz Guzik <mjguzik@...il.com>
Cc:     linux-kernel@...r.kernel.org, tj@...nel.org, cl@...ux.com,
        akpm@...ux-foundation.org, shakeelb@...gle.com,
        vegard.nossum@...cle.com, linux-mm@...ck.org
Subject: Re: [PATCH v3 0/2] execve scalability issues, part 1

Hello,

On Wed, Aug 23, 2023 at 07:06:07AM +0200, Mateusz Guzik wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
> 
> I borrowed the bench from here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
> 
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
> 
> It prints a result every second.
> 
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
> 
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
> 
> Even if this code could be patched to dodge these counters,  I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
> 
> With the fix I get an ok win, to quote from the commit:
> > Even at a very modest scale of 26 cores (ops/s):
> > before: 133543.63
> > after:  186061.81 (+39%)
> 
> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
> 
>     __pv_queued_spin_lock_slowpath+1
>     _raw_spin_lock_irqsave+57
>     folio_lruvec_lock_irqsave+91
>     release_pages+590
>     tlb_batch_pages_flush+61
>     tlb_finish_mmu+101
>     exit_mmap+327
>     __mmput+61
>     begin_new_exec+1245
>     load_elf_binary+712
>     bprm_execve+644
>     do_execveat_common.isra.0+429
>     __x64_sys_execve+50
>     do_syscall_64+46
>     entry_SYSCALL_64_after_hwframe+110
> 
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)
> 
> With this out of the way I'll be looking at some form of caching to
> eliminate these allocs as a problem.
> 
> v3:
> - fix !CONFIG_SMP build
> - drop the backtrace from fork commit message
> 
> v2:
> - force bigger alignment on alloc
> - rename "counters" to "nr_counters" and pass prior to lock key
> - drop {}'s for single-statement loops
> 
> 
> Mateusz Guzik (2):
>   pcpcntr: add group allocation/free
>   fork: group allocation of per-cpu counters for mm struct
> 
>  include/linux/percpu_counter.h | 39 ++++++++++++++++++----
>  kernel/fork.c                  | 14 ++------
>  lib/percpu_counter.c           | 61 +++++++++++++++++++++++-----------
>  3 files changed, 77 insertions(+), 37 deletions(-)
> 
> -- 
> 2.41.0
> 

I've applied both to percpu#for-6.6.

Thanks,
Dennis