lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFaQHn4X+C9GLDt5sTVD=2=PgWX-KvtBKSdqNJSD_p1sA@mail.gmail.com>
Date: Thu, 24 Apr 2025 17:00:53 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, Vlad Buslov <vladbu@...dia.com>, 
	Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>, Byungchul Park <byungchul@...com>, 
	linux-mm@...ck.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
 allocator scalability problem

On Thu, Apr 24, 2025 at 11:58 AM Harry Yoo <harry.yoo@...cle.com> wrote:
>
> On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> > On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> > >
> > > Overview
> > > ========
> > >
> > > The slab destructor feature existed in early days of slab allocator(s).
> > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > for destructors") in 2007 due to lack of serious use cases at that time.
> > >
> > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > constructor/destructor pair to mitigate the global serialization point
> > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > percpu memory during its lifetime.
> > >
> > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > so each allocate–free cycle requires two expensive acquire/release on
> > > that mutex.
> > >
> > > We can mitigate this contention by retaining the percpu regions after
> > > the object is freed and releasing them only when the backing slab pages
> > > are freed.
> > >
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > >
> > > This series is functional (although not compatible with MM debug
> > > features yet), but still far from perfect. I’m actively refining it and
> > > would appreciate early feedback before I improve it further. :)
> > >
> >
> > Thanks for looking into this.
>
> You're welcome. Thanks for the proposal.
>
> > The dtor thing poses a potential problem where a dtor acquiring
> > arbitrary locks can result in a deadlock during memory reclaim.
>
> AFAICT, MM does not reclaim slab memory unless we register shrinker
> interface to reclaim it. Or am I missing something?
>
> Hmm let's say it does anyway, then is this what you worry about?
>
> someone requests percpu memory
> -> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
> -> allocates pages from buddy
> -> buddy reclaims slab memory
> -> slab destructor calls pcpu_alloc_mutex (deadlock!)
>
> > So for this to be viable one needs to ensure that in the worst case
> > this only ever takes leaf-locks (i.e., locks which are last in any
> > dependency chain -- no locks are being taken if you hold one).
>
> Oh, then you can't allocate memory while holding pcpu_lock or
> pcpu_alloc_mutex?
>

It should be perfectly fine to allocate memory with pcpu_alloc_mutex
as it is not an inherent part of reclaim.

The part used by dtor would be the spinlock already present in the
percpu allocator.

The idea would be the mutex-protected area preps everything as needed,
but synchronisation with freeing the pcpu areas would only need the
leaf spinlock.

So issues there provided some care is employed.

> > This
> > needs to demonstrate the percpu thing qualifies or needs to refactor
> > it to that extent.
> >
> > > This series is based on slab/for-next [2].
> > >
> > > Performance Improvement
> > > =======================
> > >
> > > I measured the benefit of this series for two different users:
> > > exec() and tc filter insertion/removal.
> > >
> > > exec() throughput
> > > -----------------
> > >
> > > The performance of exec() is important when short-lived processes are
> > > frequently created. For example: shell-heavy workloads and running many
> > > test cases [3].
> > >
> > > I measured exec() throughput with a microbenchmark:
> > >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> > >   - 4.56% gain on a desktop with 24 hardware threads, and
> > >   - Even 4% gain on a single-threaded exec() throughput.
> > >
> > > Further investigation showed that this was due to the overhead of
> > > acquiring/releasing pcpu_alloc_mutex and its contention.
> > >
> > > See patch 7 for more detail on the experiment.
> > >
> > > Traffic Filter Insertion and Removal
> > > ------------------------------------
> > >
> > > Each tc filter allocates three percpu memory regions per tc_action object,
> > > so frequently inserting and removing filters contend heavily on the same
> > > mutex.
> > >
> > > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > > more detail), I observed a 26% reduction in system time and observed
> > > much less contention on pcpu_alloc_mutex with this series.
> > >
> > > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > > about tc filter insertion rate; these changes may still benefit
> > > workloads they run today.
> > >
> > > Feedback Needed from Percpu Allocator Folks
> > > ===========================================
> > >
> > > As percpu allocator is directly affected by this series, this work
> > > will need support from the percpu allocator maintainers, and we need to
> > > address their concerns.
> > >
> > > They will probably say "This is a percpu memory allocator scalability
> > > issue and we need to make it scalable"? I don't know.
> > >
> > > What do you say?
> > >
> > > Some hanging thoughts:
> > > - Tackling the problem on the slab side is much simpler, because the slab
> > >   allocator already caches objects per CPU. Re-creating similar logic
> > >   inside the percpu allocator would be redundant.
> > >
> > >   Also, since this is opt-in per slab cache, other percpu allocator
> > >   users remain unaffected.
> > >
> > > - If fragmentation is a concern, we could probably allocate larger percpu
> > >   chunks and partition them for slab objects.
> > >
> > > - If memory overhead becomes an issue, we could introduce a shrinker
> > >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> > >
> > > Patch Sequence
> > > ==============
> > >
> > > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > > fail in the next patch.
> > >
> > > Patch #2 allows the slab constructor fail.
> > >
> > > Patch #3 implements the slab destructor feature.
> > >
> > > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> > >
> > > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > > percpu counter.
> > >
> > > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> > >
> > > Known issues with MM debug features
> > > ===================================
> > >
> > > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > > and DEBUG_OBJECTS.
> > >
> > > KASAN reports an error when a percpu counter is inserted into the
> > > percpu_counters linked list because the counter has not been allocated
> > > yet.
> > >
> > > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > > the associated percpu memory is still resident in memory.
> > >
> > > I don't expect fixing these issues to be too difficult, but I need to
> > > think a little bit to fix it.
> > >
> > > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$
> > >
> > > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$
> > >
> > > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$
> > >
> > > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$
> > >
> > > Harry Yoo (7):
> > >   mm/slab: refactor freelist shuffle
> > >   treewide, slab: allow slab constructor to return an error
> > >   mm/slab: revive the destructor feature in slab allocator
> > >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> > >     alloc
> > >   mm/percpu: allow (un)charging objects without alloc/free
> > >   lib/percpu_counter: allow (un)charging percpu counters without
> > >     alloc/free
> > >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> > >
> > >  arch/powerpc/include/asm/svm.h            |   2 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> > >  arch/powerpc/mm/init-common.c             |   3 +-
> > >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> > >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> > >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> > >  arch/sh/mm/pgtable.c                      |   3 +-
> > >  arch/sparc/mm/tsb.c                       |   8 +-
> > >  block/bdev.c                              |   3 +-
> > >  drivers/dax/super.c                       |   3 +-
> > >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> > >  drivers/misc/lkdtm/heap.c                 |  12 +--
> > >  drivers/usb/mon/mon_text.c                |   5 +-
> > >  fs/9p/v9fs.c                              |   3 +-
> > >  fs/adfs/super.c                           |   3 +-
> > >  fs/affs/super.c                           |   3 +-
> > >  fs/afs/super.c                            |   5 +-
> > >  fs/befs/linuxvfs.c                        |   3 +-
> > >  fs/bfs/inode.c                            |   3 +-
> > >  fs/btrfs/inode.c                          |   3 +-
> > >  fs/ceph/super.c                           |   3 +-
> > >  fs/coda/inode.c                           |   3 +-
> > >  fs/debugfs/inode.c                        |   3 +-
> > >  fs/dlm/lowcomms.c                         |   3 +-
> > >  fs/ecryptfs/main.c                        |   5 +-
> > >  fs/efs/super.c                            |   3 +-
> > >  fs/erofs/super.c                          |   3 +-
> > >  fs/exfat/cache.c                          |   3 +-
> > >  fs/exfat/super.c                          |   3 +-
> > >  fs/ext2/super.c                           |   3 +-
> > >  fs/ext4/super.c                           |   3 +-
> > >  fs/fat/cache.c                            |   3 +-
> > >  fs/fat/inode.c                            |   3 +-
> > >  fs/fuse/inode.c                           |   3 +-
> > >  fs/gfs2/main.c                            |   9 +-
> > >  fs/hfs/super.c                            |   3 +-
> > >  fs/hfsplus/super.c                        |   3 +-
> > >  fs/hpfs/super.c                           |   3 +-
> > >  fs/hugetlbfs/inode.c                      |   3 +-
> > >  fs/inode.c                                |   3 +-
> > >  fs/isofs/inode.c                          |   3 +-
> > >  fs/jffs2/super.c                          |   3 +-
> > >  fs/jfs/super.c                            |   3 +-
> > >  fs/minix/inode.c                          |   3 +-
> > >  fs/nfs/inode.c                            |   3 +-
> > >  fs/nfs/nfs42xattr.c                       |   3 +-
> > >  fs/nilfs2/super.c                         |   6 +-
> > >  fs/ntfs3/super.c                          |   3 +-
> > >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> > >  fs/ocfs2/super.c                          |   3 +-
> > >  fs/openpromfs/inode.c                     |   3 +-
> > >  fs/orangefs/super.c                       |   3 +-
> > >  fs/overlayfs/super.c                      |   3 +-
> > >  fs/pidfs.c                                |   3 +-
> > >  fs/proc/inode.c                           |   3 +-
> > >  fs/qnx4/inode.c                           |   3 +-
> > >  fs/qnx6/inode.c                           |   3 +-
> > >  fs/romfs/super.c                          |   3 +-
> > >  fs/smb/client/cifsfs.c                    |   3 +-
> > >  fs/squashfs/super.c                       |   3 +-
> > >  fs/tracefs/inode.c                        |   3 +-
> > >  fs/ubifs/super.c                          |   3 +-
> > >  fs/udf/super.c                            |   3 +-
> > >  fs/ufs/super.c                            |   3 +-
> > >  fs/userfaultfd.c                          |   3 +-
> > >  fs/vboxsf/super.c                         |   3 +-
> > >  fs/xfs/xfs_super.c                        |   3 +-
> > >  include/linux/mm_types.h                  |  40 ++++++---
> > >  include/linux/percpu.h                    |  10 +++
> > >  include/linux/percpu_counter.h            |   2 +
> > >  include/linux/slab.h                      |  21 +++--
> > >  ipc/mqueue.c                              |   3 +-
> > >  kernel/fork.c                             |  65 +++++++++-----
> > >  kernel/rcu/refscale.c                     |   3 +-
> > >  lib/percpu_counter.c                      |  25 ++++++
> > >  lib/radix-tree.c                          |   3 +-
> > >  lib/test_meminit.c                        |   3 +-
> > >  mm/kfence/kfence_test.c                   |   5 +-
> > >  mm/percpu.c                               |  79 ++++++++++------
> > >  mm/rmap.c                                 |   3 +-
> > >  mm/shmem.c                                |   3 +-
> > >  mm/slab.h                                 |  11 +--
> > >  mm/slab_common.c                          |  43 +++++----
> > >  mm/slub.c                                 | 105 ++++++++++++++++------
> > >  net/sched/act_api.c                       |  82 +++++++++++------
> > >  net/socket.c                              |   3 +-
> > >  net/sunrpc/rpc_pipe.c                     |   3 +-
> > >  security/integrity/ima/ima_iint.c         |   3 +-
> > >  88 files changed, 518 insertions(+), 226 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
>
> --
> Cheers,
> Harry / Hyeonggon



-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ