[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230821202829.2163744-1-mjguzik@gmail.com>
Date: Mon, 21 Aug 2023 22:28:27 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: linux-kernel@...r.kernel.org
Cc: dennis@...nel.org, tj@...nel.org, cl@...ux.com,
akpm@...ux-foundation.org, shakeelb@...gle.com, linux-mm@...ck.org,
Mateusz Guzik <mjguzik@...il.com>
Subject: [PATCH 0/2] execve scalability issues, part 1
To start I figured I'm going to bench about as friendly case as it gets
-- statically linked *separate* binaries all doing execve in a loop.
I borrowed the bench from found here:
http://apollo.backplane.com/DFlyMisc/doexec.c
$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)
It prints a result every second (warning: first line is garbage).
My test box is temporarily only 26 cores and even at this scale I run
into massive lock contention stemming from back-to-back calls to
percpu_counter_init (and _destroy later).
While not a panacea, one simple thing to do here is to batch these ops.
Since the term "batching" is already used in the file, I decided to
refer to it as "grouping" instead.
Even if this code could be patched to dodge these counters, I would
argue a high-traffic alloc/free consumer is only a matter of time so it
makes sense to facilitate it.
With the fix I get an ok win, to quote from the commit:
> Even at a very modest scale of 26 cores (ops/s):
> before: 133543.63
> after: 186061.81 (+39%)
> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
>
> __pv_queued_spin_lock_slowpath+1
> _raw_spin_lock_irqsave+57
> folio_lruvec_lock_irqsave+91
> release_pages+590
> tlb_batch_pages_flush+61
> tlb_finish_mmu+101
> exit_mmap+327
> __mmput+61
> begin_new_exec+1245
> load_elf_binary+712
> bprm_execve+644
> do_execveat_common.isra.0+429
> __x64_sys_execve+50
> do_syscall_64+46
> entry_SYSCALL_64_after_hwframe+110
I intend to do more work on the area to mostly sort it out, but I would
not mind if someone else took the hammer to folio. :)
With this out of the way I'll be looking at some form of caching to
eliminate these allocs as a problem.
Thoughts?
Mateusz Guzik (2):
pcpcntr: add group allocation/free
fork: group allocation of per-cpu counters for mm struct
include/linux/percpu_counter.h | 19 ++++++++---
kernel/fork.c | 13 ++------
lib/percpu_counter.c | 61 ++++++++++++++++++++++++----------
3 files changed, 60 insertions(+), 33 deletions(-)
--
2.39.2
Powered by blists - more mailing lists