lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAoLKVwC5JCe7fbv@harry>
Date: Thu, 24 Apr 2025 18:58:01 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, Christoph Lameter <cl@...two.org>,
        David Rientjes <rientjes@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
        Vlad Buslov <vladbu@...dia.com>,
        Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>,
        Byungchul Park <byungchul@...com>, linux-mm@...ck.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the
 percpu allocator scalability problem

On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> >
> > Overview
> > ========
> >
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> >
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> >
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> >
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> >
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
> > This series is functional (although not compatible with MM debug
> > features yet), but still far from perfect. I’m actively refining it and
> > would appreciate early feedback before I improve it further. :)
> >
> 
> Thanks for looking into this.

You're welcome. Thanks for the proposal.

> The dtor thing poses a potential problem where a dtor acquiring
> arbitrary locks can result in a deadlock during memory reclaim.

AFAICT, MM does not reclaim slab memory unless we register shrinker
interface to reclaim it. Or am I missing something?

Hmm let's say it does anyway, then is this what you worry about?

someone requests percpu memory
-> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
-> allocates pages from buddy
-> buddy reclaims slab memory
-> slab destructor calls pcpu_alloc_mutex (deadlock!)

> So for this to be viable one needs to ensure that in the worst case
> this only ever takes leaf-locks (i.e., locks which are last in any
> dependency chain -- no locks are being taken if you hold one).

Oh, then you can't allocate memory while holding pcpu_lock or
pcpu_alloc_mutex?

> This
> needs to demonstrate the percpu thing qualifies or needs to refactor
> it to that extent.
>
> > This series is based on slab/for-next [2].
> >
> > Performance Improvement
> > =======================
> >
> > I measured the benefit of this series for two different users:
> > exec() and tc filter insertion/removal.
> >
> > exec() throughput
> > -----------------
> >
> > The performance of exec() is important when short-lived processes are
> > frequently created. For example: shell-heavy workloads and running many
> > test cases [3].
> >
> > I measured exec() throughput with a microbenchmark:
> >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> >   - 4.56% gain on a desktop with 24 hardware threads, and
> >   - Even 4% gain on a single-threaded exec() throughput.
> >
> > Further investigation showed that this was due to the overhead of
> > acquiring/releasing pcpu_alloc_mutex and its contention.
> >
> > See patch 7 for more detail on the experiment.
> >
> > Traffic Filter Insertion and Removal
> > ------------------------------------
> >
> > Each tc filter allocates three percpu memory regions per tc_action object,
> > so frequently inserting and removing filters contend heavily on the same
> > mutex.
> >
> > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > more detail), I observed a 26% reduction in system time and observed
> > much less contention on pcpu_alloc_mutex with this series.
> >
> > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > about tc filter insertion rate; these changes may still benefit
> > workloads they run today.
> >
> > Feedback Needed from Percpu Allocator Folks
> > ===========================================
> >
> > As percpu allocator is directly affected by this series, this work
> > will need support from the percpu allocator maintainers, and we need to
> > address their concerns.
> >
> > They will probably say "This is a percpu memory allocator scalability
> > issue and we need to make it scalable"? I don't know.
> >
> > What do you say?
> >
> > Some hanging thoughts:
> > - Tackling the problem on the slab side is much simpler, because the slab
> >   allocator already caches objects per CPU. Re-creating similar logic
> >   inside the percpu allocator would be redundant.
> >
> >   Also, since this is opt-in per slab cache, other percpu allocator
> >   users remain unaffected.
> >
> > - If fragmentation is a concern, we could probably allocate larger percpu
> >   chunks and partition them for slab objects.
> >
> > - If memory overhead becomes an issue, we could introduce a shrinker
> >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> >
> > Patch Sequence
> > ==============
> >
> > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > fail in the next patch.
> >
> > Patch #2 allows the slab constructor fail.
> >
> > Patch #3 implements the slab destructor feature.
> >
> > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> >
> > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > percpu counter.
> >
> > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> >
> > Known issues with MM debug features
> > ===================================
> >
> > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > and DEBUG_OBJECTS.
> >
> > KASAN reports an error when a percpu counter is inserted into the
> > percpu_counters linked list because the counter has not been allocated
> > yet.
> >
> > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > the associated percpu memory is still resident in memory.
> >
> > I don't expect fixing these issues to be too difficult, but I need to
> > think a little bit to fix it.
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$ 
> >
> > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$ 
> >
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$ 
> >
> > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$ 
> >
> > Harry Yoo (7):
> >   mm/slab: refactor freelist shuffle
> >   treewide, slab: allow slab constructor to return an error
> >   mm/slab: revive the destructor feature in slab allocator
> >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> >     alloc
> >   mm/percpu: allow (un)charging objects without alloc/free
> >   lib/percpu_counter: allow (un)charging percpu counters without
> >     alloc/free
> >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> >
> >  arch/powerpc/include/asm/svm.h            |   2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> >  arch/powerpc/mm/init-common.c             |   3 +-
> >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> >  arch/sh/mm/pgtable.c                      |   3 +-
> >  arch/sparc/mm/tsb.c                       |   8 +-
> >  block/bdev.c                              |   3 +-
> >  drivers/dax/super.c                       |   3 +-
> >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> >  drivers/misc/lkdtm/heap.c                 |  12 +--
> >  drivers/usb/mon/mon_text.c                |   5 +-
> >  fs/9p/v9fs.c                              |   3 +-
> >  fs/adfs/super.c                           |   3 +-
> >  fs/affs/super.c                           |   3 +-
> >  fs/afs/super.c                            |   5 +-
> >  fs/befs/linuxvfs.c                        |   3 +-
> >  fs/bfs/inode.c                            |   3 +-
> >  fs/btrfs/inode.c                          |   3 +-
> >  fs/ceph/super.c                           |   3 +-
> >  fs/coda/inode.c                           |   3 +-
> >  fs/debugfs/inode.c                        |   3 +-
> >  fs/dlm/lowcomms.c                         |   3 +-
> >  fs/ecryptfs/main.c                        |   5 +-
> >  fs/efs/super.c                            |   3 +-
> >  fs/erofs/super.c                          |   3 +-
> >  fs/exfat/cache.c                          |   3 +-
> >  fs/exfat/super.c                          |   3 +-
> >  fs/ext2/super.c                           |   3 +-
> >  fs/ext4/super.c                           |   3 +-
> >  fs/fat/cache.c                            |   3 +-
> >  fs/fat/inode.c                            |   3 +-
> >  fs/fuse/inode.c                           |   3 +-
> >  fs/gfs2/main.c                            |   9 +-
> >  fs/hfs/super.c                            |   3 +-
> >  fs/hfsplus/super.c                        |   3 +-
> >  fs/hpfs/super.c                           |   3 +-
> >  fs/hugetlbfs/inode.c                      |   3 +-
> >  fs/inode.c                                |   3 +-
> >  fs/isofs/inode.c                          |   3 +-
> >  fs/jffs2/super.c                          |   3 +-
> >  fs/jfs/super.c                            |   3 +-
> >  fs/minix/inode.c                          |   3 +-
> >  fs/nfs/inode.c                            |   3 +-
> >  fs/nfs/nfs42xattr.c                       |   3 +-
> >  fs/nilfs2/super.c                         |   6 +-
> >  fs/ntfs3/super.c                          |   3 +-
> >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> >  fs/ocfs2/super.c                          |   3 +-
> >  fs/openpromfs/inode.c                     |   3 +-
> >  fs/orangefs/super.c                       |   3 +-
> >  fs/overlayfs/super.c                      |   3 +-
> >  fs/pidfs.c                                |   3 +-
> >  fs/proc/inode.c                           |   3 +-
> >  fs/qnx4/inode.c                           |   3 +-
> >  fs/qnx6/inode.c                           |   3 +-
> >  fs/romfs/super.c                          |   3 +-
> >  fs/smb/client/cifsfs.c                    |   3 +-
> >  fs/squashfs/super.c                       |   3 +-
> >  fs/tracefs/inode.c                        |   3 +-
> >  fs/ubifs/super.c                          |   3 +-
> >  fs/udf/super.c                            |   3 +-
> >  fs/ufs/super.c                            |   3 +-
> >  fs/userfaultfd.c                          |   3 +-
> >  fs/vboxsf/super.c                         |   3 +-
> >  fs/xfs/xfs_super.c                        |   3 +-
> >  include/linux/mm_types.h                  |  40 ++++++---
> >  include/linux/percpu.h                    |  10 +++
> >  include/linux/percpu_counter.h            |   2 +
> >  include/linux/slab.h                      |  21 +++--
> >  ipc/mqueue.c                              |   3 +-
> >  kernel/fork.c                             |  65 +++++++++-----
> >  kernel/rcu/refscale.c                     |   3 +-
> >  lib/percpu_counter.c                      |  25 ++++++
> >  lib/radix-tree.c                          |   3 +-
> >  lib/test_meminit.c                        |   3 +-
> >  mm/kfence/kfence_test.c                   |   5 +-
> >  mm/percpu.c                               |  79 ++++++++++------
> >  mm/rmap.c                                 |   3 +-
> >  mm/shmem.c                                |   3 +-
> >  mm/slab.h                                 |  11 +--
> >  mm/slab_common.c                          |  43 +++++----
> >  mm/slub.c                                 | 105 ++++++++++++++++------
> >  net/sched/act_api.c                       |  82 +++++++++++------
> >  net/socket.c                              |   3 +-
> >  net/sunrpc/rpc_pipe.c                     |   3 +-
> >  security/integrity/ima/ima_iint.c         |   3 +-
> >  88 files changed, 518 insertions(+), 226 deletions(-)
> >
> > --
> > 2.43.0
> >
> 
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>

-- 
Cheers,
Harry / Hyeonggon

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ