[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251127233635.4170047-1-krisman@suse.de>
Date: Thu, 27 Nov 2025 18:36:27 -0500
From: Gabriel Krisman Bertazi <krisman@...e.de>
To: linux-mm@...ck.org
Cc: Gabriel Krisman Bertazi <krisman@...e.de>,
linux-kernel@...r.kernel.org,
jack@...e.cz,
Mateusz Guzik <mjguzik@...il.com>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Michal Hocko <mhocko@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>,
Christoph Lameter <cl@...two.org>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>
Subject: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
The cost of the pcpu memory allocation is non-negligible for systems
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions. In particular, Jan Kara reported the
commit introducing per-cpu counters for rss_stat caused a 10% regression
of system time for gitsource in his system [1]. In that same occasion,
Jan suggested we special-cased the single-threaded case: since we know
there won't be frequent remote updates of rss_stats for single-threaded
applications, we could special case it with a local counter for most
updates, and an atomic counter for the infrequent remote updates. This
patchset implements this idea.
It exposes a dual-mode counter that starts as a simple counter, cheap to
initialize on single-threaded tasks, that can be upgraded inflight to a
fully-fledged per cpu counter later. Patch 3 then modifies the rss_stat
counters to use that structure, forcing the upgrade as soon as a second
task sharing the mm_struct is spawned. By delaying the initialization
cost until the MM is shared, we cover single-threaded applications
fairly cheaply, while not penalizing applications that spawn multiple
threads. On a 256c system, where the pcpu allocation of the rss_stats
is quite noticeable, this has reduced the wall-clock time between 6%
15% (depending on the number of cores) of an artificial fork-intensive
microbenchmark (calling /bin/true in a loop). In a more realistic
benchmark, it showed an improvement of 1.5% on kernbench elapsed time.
More performance data, including profilings is available in the patch
modifying the rss_stat counters.
While this patch exposes a single users of this API, this should be
useful in more cases. This is why I made it into a proper API. In
addition, considering the recent efforts in this area, such as
hierarchical per-cpu counters which are orthogonal to this work because
they improve multi-threaded workloads, abstracting this with a new API
could help the merging of both works.
Finally, this is a RFC because it is an early work. in particular, I'd
be interested in more benchmarks suggestions, and I'd like feedback
whether this new interface should be implemented inside percpu_counters
as lazy counters or as a completely separated interface.
Thanks,
[1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3
---
Cc: linux-kernel@...r.kernel.org
Cc: jack@...e.cz
Cc: Mateusz Guzik <mjguzik@...il.com>
Cc: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Michal Hocko <mhocko@...nel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Dennis Zhou <dennis@...nel.org>
Cc: Tejun Heo <tj@...nel.org>
Cc: Christoph Lameter <cl@...two.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: David Hildenbrand <david@...hat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: Mike Rapoport <rppt@...nel.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>
Gabriel Krisman Bertazi (4):
lib/percpu_counter: Split out a helper to insert into hotplug list
lib: Support lazy initialization of per-cpu counters
mm: Avoid percpu MM counters on single-threaded tasks
mm: Split a slow path for updating mm counters
arch/s390/mm/gmap_helpers.c | 4 +-
arch/s390/mm/pgtable.c | 4 +-
fs/exec.c | 2 +-
include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++
include/linux/mm.h | 26 ++---
include/linux/mm_types.h | 4 +-
include/linux/percpu_counter.h | 5 +-
include/trace/events/kmem.h | 4 +-
kernel/events/uprobes.c | 2 +-
kernel/fork.c | 14 ++-
lib/percpu_counter.c | 68 ++++++++++---
mm/filemap.c | 2 +-
mm/huge_memory.c | 22 ++---
mm/khugepaged.c | 6 +-
mm/ksm.c | 2 +-
mm/madvise.c | 2 +-
mm/memory.c | 20 ++--
mm/migrate.c | 2 +-
mm/migrate_device.c | 2 +-
mm/rmap.c | 16 +--
mm/swapfile.c | 6 +-
mm/userfaultfd.c | 2 +-
22 files changed, 276 insertions(+), 84 deletions(-)
create mode 100644 include/linux/lazy_percpu_counter.h
--
2.51.0
Powered by blists - more mailing lists