lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHGkNn032RVuJdwYqpzfQgAB8pv=hEzdR1APsFOOSQnq1Q@mail.gmail.com>
Date: Thu, 24 Apr 2025 11:29:13 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, Vlad Buslov <vladbu@...dia.com>, 
	Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>, Byungchul Park <byungchul@...com>, 
	linux-mm@...ck.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
 allocator scalability problem

On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@...cle.com> wrote:
>
> Overview
> ========
>
> The slab destructor feature existed in early days of slab allocator(s).
> It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> for destructors") in 2007 due to lack of serious use cases at that time.
>
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
>
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.
>
> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.
>
> How to do this with slab constructors and destructors: the constructor
> allocates percpu memory, and the destructor frees it when the slab pages
> are reclaimed; this slightly alters the constructor’s semantics,
> as it can now fail.
>
> This series is functional (although not compatible with MM debug
> features yet), but still far from perfect. I’m actively refining it and
> would appreciate early feedback before I improve it further. :)
>

Thanks for looking into this.

The dtor thing poses a potential problem where a dtor acquiring
arbitrary locks can result in a deadlock during memory reclaim.

So for this to be viable one needs to ensure that in the worst case
this only ever takes leaf-locks (i.e., locks which are last in any
dependency chain -- no locks are being taken if you hold one). This
needs to demonstrate the percpu thing qualifies or needs to refactor
it to that extent.


> This series is based on slab/for-next [2].
>
> Performance Improvement
> =======================
>
> I measured the benefit of this series for two different users:
> exec() and tc filter insertion/removal.
>
> exec() throughput
> -----------------
>
> The performance of exec() is important when short-lived processes are
> frequently created. For example: shell-heavy workloads and running many
> test cases [3].
>
> I measured exec() throughput with a microbenchmark:
>   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
>   - 4.56% gain on a desktop with 24 hardware threads, and
>   - Even 4% gain on a single-threaded exec() throughput.
>
> Further investigation showed that this was due to the overhead of
> acquiring/releasing pcpu_alloc_mutex and its contention.
>
> See patch 7 for more detail on the experiment.
>
> Traffic Filter Insertion and Removal
> ------------------------------------
>
> Each tc filter allocates three percpu memory regions per tc_action object,
> so frequently inserting and removing filters contend heavily on the same
> mutex.
>
> In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> more detail), I observed a 26% reduction in system time and observed
> much less contention on pcpu_alloc_mutex with this series.
>
> I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> about tc filter insertion rate; these changes may still benefit
> workloads they run today.
>
> Feedback Needed from Percpu Allocator Folks
> ===========================================
>
> As percpu allocator is directly affected by this series, this work
> will need support from the percpu allocator maintainers, and we need to
> address their concerns.
>
> They will probably say "This is a percpu memory allocator scalability
> issue and we need to make it scalable"? I don't know.
>
> What do you say?
>
> Some hanging thoughts:
> - Tackling the problem on the slab side is much simpler, because the slab
>   allocator already caches objects per CPU. Re-creating similar logic
>   inside the percpu allocator would be redundant.
>
>   Also, since this is opt-in per slab cache, other percpu allocator
>   users remain unaffected.
>
> - If fragmentation is a concern, we could probably allocate larger percpu
>   chunks and partition them for slab objects.
>
> - If memory overhead becomes an issue, we could introduce a shrinker
>   to free empty slabs (and thus releasing underlying percpu memory chunks).
>
> Patch Sequence
> ==============
>
> Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> fail in the next patch.
>
> Patch #2 allows the slab constructor fail.
>
> Patch #3 implements the slab destructor feature.
>
> Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
>
> Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> percpu counter.
>
> Patch #7 converts mm_struct to use the slab ctor/dtor pair.
>
> Known issues with MM debug features
> ===================================
>
> The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> and DEBUG_OBJECTS.
>
> KASAN reports an error when a percpu counter is inserted into the
> percpu_counters linked list because the counter has not been allocated
> yet.
>
> DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> the associated percpu memory is still resident in memory.
>
> I don't expect fixing these issues to be too difficult, but I need to
> think a little bit to fix it.
>
> [1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
>
> [3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3
>
> [4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com
>
> Harry Yoo (7):
>   mm/slab: refactor freelist shuffle
>   treewide, slab: allow slab constructor to return an error
>   mm/slab: revive the destructor feature in slab allocator
>   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
>     alloc
>   mm/percpu: allow (un)charging objects without alloc/free
>   lib/percpu_counter: allow (un)charging percpu counters without
>     alloc/free
>   kernel/fork: improve exec() throughput with slab ctor/dtor pair
>
>  arch/powerpc/include/asm/svm.h            |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
>  arch/powerpc/mm/init-common.c             |   3 +-
>  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
>  arch/powerpc/platforms/pseries/setup.c    |   2 +-
>  arch/powerpc/platforms/pseries/svm.c      |   4 +-
>  arch/sh/mm/pgtable.c                      |   3 +-
>  arch/sparc/mm/tsb.c                       |   8 +-
>  block/bdev.c                              |   3 +-
>  drivers/dax/super.c                       |   3 +-
>  drivers/gpu/drm/i915/i915_request.c       |   3 +-
>  drivers/misc/lkdtm/heap.c                 |  12 +--
>  drivers/usb/mon/mon_text.c                |   5 +-
>  fs/9p/v9fs.c                              |   3 +-
>  fs/adfs/super.c                           |   3 +-
>  fs/affs/super.c                           |   3 +-
>  fs/afs/super.c                            |   5 +-
>  fs/befs/linuxvfs.c                        |   3 +-
>  fs/bfs/inode.c                            |   3 +-
>  fs/btrfs/inode.c                          |   3 +-
>  fs/ceph/super.c                           |   3 +-
>  fs/coda/inode.c                           |   3 +-
>  fs/debugfs/inode.c                        |   3 +-
>  fs/dlm/lowcomms.c                         |   3 +-
>  fs/ecryptfs/main.c                        |   5 +-
>  fs/efs/super.c                            |   3 +-
>  fs/erofs/super.c                          |   3 +-
>  fs/exfat/cache.c                          |   3 +-
>  fs/exfat/super.c                          |   3 +-
>  fs/ext2/super.c                           |   3 +-
>  fs/ext4/super.c                           |   3 +-
>  fs/fat/cache.c                            |   3 +-
>  fs/fat/inode.c                            |   3 +-
>  fs/fuse/inode.c                           |   3 +-
>  fs/gfs2/main.c                            |   9 +-
>  fs/hfs/super.c                            |   3 +-
>  fs/hfsplus/super.c                        |   3 +-
>  fs/hpfs/super.c                           |   3 +-
>  fs/hugetlbfs/inode.c                      |   3 +-
>  fs/inode.c                                |   3 +-
>  fs/isofs/inode.c                          |   3 +-
>  fs/jffs2/super.c                          |   3 +-
>  fs/jfs/super.c                            |   3 +-
>  fs/minix/inode.c                          |   3 +-
>  fs/nfs/inode.c                            |   3 +-
>  fs/nfs/nfs42xattr.c                       |   3 +-
>  fs/nilfs2/super.c                         |   6 +-
>  fs/ntfs3/super.c                          |   3 +-
>  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
>  fs/ocfs2/super.c                          |   3 +-
>  fs/openpromfs/inode.c                     |   3 +-
>  fs/orangefs/super.c                       |   3 +-
>  fs/overlayfs/super.c                      |   3 +-
>  fs/pidfs.c                                |   3 +-
>  fs/proc/inode.c                           |   3 +-
>  fs/qnx4/inode.c                           |   3 +-
>  fs/qnx6/inode.c                           |   3 +-
>  fs/romfs/super.c                          |   3 +-
>  fs/smb/client/cifsfs.c                    |   3 +-
>  fs/squashfs/super.c                       |   3 +-
>  fs/tracefs/inode.c                        |   3 +-
>  fs/ubifs/super.c                          |   3 +-
>  fs/udf/super.c                            |   3 +-
>  fs/ufs/super.c                            |   3 +-
>  fs/userfaultfd.c                          |   3 +-
>  fs/vboxsf/super.c                         |   3 +-
>  fs/xfs/xfs_super.c                        |   3 +-
>  include/linux/mm_types.h                  |  40 ++++++---
>  include/linux/percpu.h                    |  10 +++
>  include/linux/percpu_counter.h            |   2 +
>  include/linux/slab.h                      |  21 +++--
>  ipc/mqueue.c                              |   3 +-
>  kernel/fork.c                             |  65 +++++++++-----
>  kernel/rcu/refscale.c                     |   3 +-
>  lib/percpu_counter.c                      |  25 ++++++
>  lib/radix-tree.c                          |   3 +-
>  lib/test_meminit.c                        |   3 +-
>  mm/kfence/kfence_test.c                   |   5 +-
>  mm/percpu.c                               |  79 ++++++++++------
>  mm/rmap.c                                 |   3 +-
>  mm/shmem.c                                |   3 +-
>  mm/slab.h                                 |  11 +--
>  mm/slab_common.c                          |  43 +++++----
>  mm/slub.c                                 | 105 ++++++++++++++++------
>  net/sched/act_api.c                       |  82 +++++++++++------
>  net/socket.c                              |   3 +-
>  net/sunrpc/rpc_pipe.c                     |   3 +-
>  security/integrity/ima/ima_iint.c         |   3 +-
>  88 files changed, 518 insertions(+), 226 deletions(-)
>
> --
> 2.43.0
>


-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ