[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAoLKVwC5JCe7fbv@harry>
Date: Thu, 24 Apr 2025 18:58:01 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, Christoph Lameter <cl@...two.org>,
David Rientjes <rientjes@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Vlad Buslov <vladbu@...dia.com>,
Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>,
Byungchul Park <byungchul@...com>, linux-mm@...ck.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the
percpu allocator scalability problem
On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> >
> > Overview
> > ========
> >
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> >
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> >
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> >
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> >
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
> > This series is functional (although not compatible with MM debug
> > features yet), but still far from perfect. I’m actively refining it and
> > would appreciate early feedback before I improve it further. :)
> >
>
> Thanks for looking into this.
You're welcome. Thanks for the proposal.
> The dtor thing poses a potential problem where a dtor acquiring
> arbitrary locks can result in a deadlock during memory reclaim.
AFAICT, MM does not reclaim slab memory unless we register shrinker
interface to reclaim it. Or am I missing something?
Hmm let's say it does anyway, then is this what you worry about?
someone requests percpu memory
-> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
-> allocates pages from buddy
-> buddy reclaims slab memory
-> slab destructor calls pcpu_alloc_mutex (deadlock!)
> So for this to be viable one needs to ensure that in the worst case
> this only ever takes leaf-locks (i.e., locks which are last in any
> dependency chain -- no locks are being taken if you hold one).
Oh, then you can't allocate memory while holding pcpu_lock or
pcpu_alloc_mutex?
> This
> needs to demonstrate the percpu thing qualifies or needs to refactor
> it to that extent.
>
> > This series is based on slab/for-next [2].
> >
> > Performance Improvement
> > =======================
> >
> > I measured the benefit of this series for two different users:
> > exec() and tc filter insertion/removal.
> >
> > exec() throughput
> > -----------------
> >
> > The performance of exec() is important when short-lived processes are
> > frequently created. For example: shell-heavy workloads and running many
> > test cases [3].
> >
> > I measured exec() throughput with a microbenchmark:
> > - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> > - 4.56% gain on a desktop with 24 hardware threads, and
> > - Even 4% gain on a single-threaded exec() throughput.
> >
> > Further investigation showed that this was due to the overhead of
> > acquiring/releasing pcpu_alloc_mutex and its contention.
> >
> > See patch 7 for more detail on the experiment.
> >
> > Traffic Filter Insertion and Removal
> > ------------------------------------
> >
> > Each tc filter allocates three percpu memory regions per tc_action object,
> > so frequently inserting and removing filters contend heavily on the same
> > mutex.
> >
> > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > more detail), I observed a 26% reduction in system time and observed
> > much less contention on pcpu_alloc_mutex with this series.
> >
> > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > about tc filter insertion rate; these changes may still benefit
> > workloads they run today.
> >
> > Feedback Needed from Percpu Allocator Folks
> > ===========================================
> >
> > As percpu allocator is directly affected by this series, this work
> > will need support from the percpu allocator maintainers, and we need to
> > address their concerns.
> >
> > They will probably say "This is a percpu memory allocator scalability
> > issue and we need to make it scalable"? I don't know.
> >
> > What do you say?
> >
> > Some hanging thoughts:
> > - Tackling the problem on the slab side is much simpler, because the slab
> > allocator already caches objects per CPU. Re-creating similar logic
> > inside the percpu allocator would be redundant.
> >
> > Also, since this is opt-in per slab cache, other percpu allocator
> > users remain unaffected.
> >
> > - If fragmentation is a concern, we could probably allocate larger percpu
> > chunks and partition them for slab objects.
> >
> > - If memory overhead becomes an issue, we could introduce a shrinker
> > to free empty slabs (and thus releasing underlying percpu memory chunks).
> >
> > Patch Sequence
> > ==============
> >
> > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > fail in the next patch.
> >
> > Patch #2 allows the slab constructor fail.
> >
> > Patch #3 implements the slab destructor feature.
> >
> > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> >
> > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > percpu counter.
> >
> > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> >
> > Known issues with MM debug features
> > ===================================
> >
> > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > and DEBUG_OBJECTS.
> >
> > KASAN reports an error when a percpu counter is inserted into the
> > percpu_counters linked list because the counter has not been allocated
> > yet.
> >
> > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > the associated percpu memory is still resident in memory.
> >
> > I don't expect fixing these issues to be too difficult, but I need to
> > think a little bit to fix it.
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$
> >
> > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$
> >
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$
> >
> > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$
> >
> > Harry Yoo (7):
> > mm/slab: refactor freelist shuffle
> > treewide, slab: allow slab constructor to return an error
> > mm/slab: revive the destructor feature in slab allocator
> > net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> > alloc
> > mm/percpu: allow (un)charging objects without alloc/free
> > lib/percpu_counter: allow (un)charging percpu counters without
> > alloc/free
> > kernel/fork: improve exec() throughput with slab ctor/dtor pair
> >
> > arch/powerpc/include/asm/svm.h | 2 +-
> > arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +-
> > arch/powerpc/mm/init-common.c | 3 +-
> > arch/powerpc/platforms/cell/spufs/inode.c | 3 +-
> > arch/powerpc/platforms/pseries/setup.c | 2 +-
> > arch/powerpc/platforms/pseries/svm.c | 4 +-
> > arch/sh/mm/pgtable.c | 3 +-
> > arch/sparc/mm/tsb.c | 8 +-
> > block/bdev.c | 3 +-
> > drivers/dax/super.c | 3 +-
> > drivers/gpu/drm/i915/i915_request.c | 3 +-
> > drivers/misc/lkdtm/heap.c | 12 +--
> > drivers/usb/mon/mon_text.c | 5 +-
> > fs/9p/v9fs.c | 3 +-
> > fs/adfs/super.c | 3 +-
> > fs/affs/super.c | 3 +-
> > fs/afs/super.c | 5 +-
> > fs/befs/linuxvfs.c | 3 +-
> > fs/bfs/inode.c | 3 +-
> > fs/btrfs/inode.c | 3 +-
> > fs/ceph/super.c | 3 +-
> > fs/coda/inode.c | 3 +-
> > fs/debugfs/inode.c | 3 +-
> > fs/dlm/lowcomms.c | 3 +-
> > fs/ecryptfs/main.c | 5 +-
> > fs/efs/super.c | 3 +-
> > fs/erofs/super.c | 3 +-
> > fs/exfat/cache.c | 3 +-
> > fs/exfat/super.c | 3 +-
> > fs/ext2/super.c | 3 +-
> > fs/ext4/super.c | 3 +-
> > fs/fat/cache.c | 3 +-
> > fs/fat/inode.c | 3 +-
> > fs/fuse/inode.c | 3 +-
> > fs/gfs2/main.c | 9 +-
> > fs/hfs/super.c | 3 +-
> > fs/hfsplus/super.c | 3 +-
> > fs/hpfs/super.c | 3 +-
> > fs/hugetlbfs/inode.c | 3 +-
> > fs/inode.c | 3 +-
> > fs/isofs/inode.c | 3 +-
> > fs/jffs2/super.c | 3 +-
> > fs/jfs/super.c | 3 +-
> > fs/minix/inode.c | 3 +-
> > fs/nfs/inode.c | 3 +-
> > fs/nfs/nfs42xattr.c | 3 +-
> > fs/nilfs2/super.c | 6 +-
> > fs/ntfs3/super.c | 3 +-
> > fs/ocfs2/dlmfs/dlmfs.c | 3 +-
> > fs/ocfs2/super.c | 3 +-
> > fs/openpromfs/inode.c | 3 +-
> > fs/orangefs/super.c | 3 +-
> > fs/overlayfs/super.c | 3 +-
> > fs/pidfs.c | 3 +-
> > fs/proc/inode.c | 3 +-
> > fs/qnx4/inode.c | 3 +-
> > fs/qnx6/inode.c | 3 +-
> > fs/romfs/super.c | 3 +-
> > fs/smb/client/cifsfs.c | 3 +-
> > fs/squashfs/super.c | 3 +-
> > fs/tracefs/inode.c | 3 +-
> > fs/ubifs/super.c | 3 +-
> > fs/udf/super.c | 3 +-
> > fs/ufs/super.c | 3 +-
> > fs/userfaultfd.c | 3 +-
> > fs/vboxsf/super.c | 3 +-
> > fs/xfs/xfs_super.c | 3 +-
> > include/linux/mm_types.h | 40 ++++++---
> > include/linux/percpu.h | 10 +++
> > include/linux/percpu_counter.h | 2 +
> > include/linux/slab.h | 21 +++--
> > ipc/mqueue.c | 3 +-
> > kernel/fork.c | 65 +++++++++-----
> > kernel/rcu/refscale.c | 3 +-
> > lib/percpu_counter.c | 25 ++++++
> > lib/radix-tree.c | 3 +-
> > lib/test_meminit.c | 3 +-
> > mm/kfence/kfence_test.c | 5 +-
> > mm/percpu.c | 79 ++++++++++------
> > mm/rmap.c | 3 +-
> > mm/shmem.c | 3 +-
> > mm/slab.h | 11 +--
> > mm/slab_common.c | 43 +++++----
> > mm/slub.c | 105 ++++++++++++++++------
> > net/sched/act_api.c | 82 +++++++++++------
> > net/socket.c | 3 +-
> > net/sunrpc/rpc_pipe.c | 3 +-
> > security/integrity/ima/ima_iint.c | 3 +-
> > 88 files changed, 518 insertions(+), 226 deletions(-)
> >
> > --
> > 2.43.0
> >
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
--
Cheers,
Harry / Hyeonggon
Powered by blists - more mailing lists