[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250424080755.272925-1-harry.yoo@oracle.com>
Date: Thu, 24 Apr 2025 17:07:48 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>, Christoph Lameter <cl@...two.org>,
David Rientjes <rientjes@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Mateusz Guzik <mjguzik@...il.com>
Cc: Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>, Vlad Buslov <vladbu@...dia.com>,
Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>,
Byungchul Park <byungchul@...com>, linux-mm@...ck.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
Harry Yoo <harry.yoo@...cle.com>
Subject: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
Overview
========
The slab destructor feature existed in early days of slab allocator(s).
It was removed by the commit c59def9f222d ("Slab allocators: Drop support
for destructors") in 2007 due to lack of serious use cases at that time.
Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
constructor/destructor pair to mitigate the global serialization point
(pcpu_alloc_mutex) that occurs when each slab object allocates and frees
percpu memory during its lifetime.
Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
so each allocate–free cycle requires two expensive acquire/release on
that mutex.
We can mitigate this contention by retaining the percpu regions after
the object is freed and releasing them only when the backing slab pages
are freed.
How to do this with slab constructors and destructors: the constructor
allocates percpu memory, and the destructor frees it when the slab pages
are reclaimed; this slightly alters the constructor’s semantics,
as it can now fail.
This series is functional (although not compatible with MM debug
features yet), but still far from perfect. I’m actively refining it and
would appreciate early feedback before I improve it further. :)
This series is based on slab/for-next [2].
Performance Improvement
=======================
I measured the benefit of this series for two different users:
exec() and tc filter insertion/removal.
exec() throughput
-----------------
The performance of exec() is important when short-lived processes are
frequently created. For example: shell-heavy workloads and running many
test cases [3].
I measured exec() throughput with a microbenchmark:
- 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
- 4.56% gain on a desktop with 24 hardware threads, and
- Even 4% gain on a single-threaded exec() throughput.
Further investigation showed that this was due to the overhead of
acquiring/releasing pcpu_alloc_mutex and its contention.
See patch 7 for more detail on the experiment.
Traffic Filter Insertion and Removal
------------------------------------
Each tc filter allocates three percpu memory regions per tc_action object,
so frequently inserting and removing filters contend heavily on the same
mutex.
In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
more detail), I observed a 26% reduction in system time and observed
much less contention on pcpu_alloc_mutex with this series.
I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
about tc filter insertion rate; these changes may still benefit
workloads they run today.
Feedback Needed from Percpu Allocator Folks
===========================================
As percpu allocator is directly affected by this series, this work
will need support from the percpu allocator maintainers, and we need to
address their concerns.
They will probably say "This is a percpu memory allocator scalability
issue and we need to make it scalable"? I don't know.
What do you say?
Some hanging thoughts:
- Tackling the problem on the slab side is much simpler, because the slab
allocator already caches objects per CPU. Re-creating similar logic
inside the percpu allocator would be redundant.
Also, since this is opt-in per slab cache, other percpu allocator
users remain unaffected.
- If fragmentation is a concern, we could probably allocate larger percpu
chunks and partition them for slab objects.
- If memory overhead becomes an issue, we could introduce a shrinker
to free empty slabs (and thus releasing underlying percpu memory chunks).
Patch Sequence
==============
Patch #1 refactors freelist_shuffle() to allow the slab constructor to
fail in the next patch.
Patch #2 allows the slab constructor fail.
Patch #3 implements the slab destructor feature.
Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
Patch #5, #6 implements APIs to charge and uncharge percpu memory and
percpu counter.
Patch #7 converts mm_struct to use the slab ctor/dtor pair.
Known issues with MM debug features
===================================
The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
and DEBUG_OBJECTS.
KASAN reports an error when a percpu counter is inserted into the
percpu_counters linked list because the counter has not been allocated
yet.
DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
the associated percpu memory is still resident in memory.
I don't expect fixing these issues to be too difficult, but I need to
think a little bit to fix it.
[1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com
[2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
[3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3
[4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com
Harry Yoo (7):
mm/slab: refactor freelist shuffle
treewide, slab: allow slab constructor to return an error
mm/slab: revive the destructor feature in slab allocator
net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
alloc
mm/percpu: allow (un)charging objects without alloc/free
lib/percpu_counter: allow (un)charging percpu counters without
alloc/free
kernel/fork: improve exec() throughput with slab ctor/dtor pair
arch/powerpc/include/asm/svm.h | 2 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +-
arch/powerpc/mm/init-common.c | 3 +-
arch/powerpc/platforms/cell/spufs/inode.c | 3 +-
arch/powerpc/platforms/pseries/setup.c | 2 +-
arch/powerpc/platforms/pseries/svm.c | 4 +-
arch/sh/mm/pgtable.c | 3 +-
arch/sparc/mm/tsb.c | 8 +-
block/bdev.c | 3 +-
drivers/dax/super.c | 3 +-
drivers/gpu/drm/i915/i915_request.c | 3 +-
drivers/misc/lkdtm/heap.c | 12 +--
drivers/usb/mon/mon_text.c | 5 +-
fs/9p/v9fs.c | 3 +-
fs/adfs/super.c | 3 +-
fs/affs/super.c | 3 +-
fs/afs/super.c | 5 +-
fs/befs/linuxvfs.c | 3 +-
fs/bfs/inode.c | 3 +-
fs/btrfs/inode.c | 3 +-
fs/ceph/super.c | 3 +-
fs/coda/inode.c | 3 +-
fs/debugfs/inode.c | 3 +-
fs/dlm/lowcomms.c | 3 +-
fs/ecryptfs/main.c | 5 +-
fs/efs/super.c | 3 +-
fs/erofs/super.c | 3 +-
fs/exfat/cache.c | 3 +-
fs/exfat/super.c | 3 +-
fs/ext2/super.c | 3 +-
fs/ext4/super.c | 3 +-
fs/fat/cache.c | 3 +-
fs/fat/inode.c | 3 +-
fs/fuse/inode.c | 3 +-
fs/gfs2/main.c | 9 +-
fs/hfs/super.c | 3 +-
fs/hfsplus/super.c | 3 +-
fs/hpfs/super.c | 3 +-
fs/hugetlbfs/inode.c | 3 +-
fs/inode.c | 3 +-
fs/isofs/inode.c | 3 +-
fs/jffs2/super.c | 3 +-
fs/jfs/super.c | 3 +-
fs/minix/inode.c | 3 +-
fs/nfs/inode.c | 3 +-
fs/nfs/nfs42xattr.c | 3 +-
fs/nilfs2/super.c | 6 +-
fs/ntfs3/super.c | 3 +-
fs/ocfs2/dlmfs/dlmfs.c | 3 +-
fs/ocfs2/super.c | 3 +-
fs/openpromfs/inode.c | 3 +-
fs/orangefs/super.c | 3 +-
fs/overlayfs/super.c | 3 +-
fs/pidfs.c | 3 +-
fs/proc/inode.c | 3 +-
fs/qnx4/inode.c | 3 +-
fs/qnx6/inode.c | 3 +-
fs/romfs/super.c | 3 +-
fs/smb/client/cifsfs.c | 3 +-
fs/squashfs/super.c | 3 +-
fs/tracefs/inode.c | 3 +-
fs/ubifs/super.c | 3 +-
fs/udf/super.c | 3 +-
fs/ufs/super.c | 3 +-
fs/userfaultfd.c | 3 +-
fs/vboxsf/super.c | 3 +-
fs/xfs/xfs_super.c | 3 +-
include/linux/mm_types.h | 40 ++++++---
include/linux/percpu.h | 10 +++
include/linux/percpu_counter.h | 2 +
include/linux/slab.h | 21 +++--
ipc/mqueue.c | 3 +-
kernel/fork.c | 65 +++++++++-----
kernel/rcu/refscale.c | 3 +-
lib/percpu_counter.c | 25 ++++++
lib/radix-tree.c | 3 +-
lib/test_meminit.c | 3 +-
mm/kfence/kfence_test.c | 5 +-
mm/percpu.c | 79 ++++++++++------
mm/rmap.c | 3 +-
mm/shmem.c | 3 +-
mm/slab.h | 11 +--
mm/slab_common.c | 43 +++++----
mm/slub.c | 105 ++++++++++++++++------
net/sched/act_api.c | 82 +++++++++++------
net/socket.c | 3 +-
net/sunrpc/rpc_pipe.c | 3 +-
security/integrity/ima/ima_iint.c | 3 +-
88 files changed, 518 insertions(+), 226 deletions(-)
--
2.43.0
Powered by blists - more mailing lists