linux-kernel - Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBAmi38oWka6ckjk@harry>
Date: Tue, 29 Apr 2025 10:08:27 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Christoph Lameter <cl@...ux.com>, David Rientjes <rientjes@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Uladzislau Rezki <urezki@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
        maple-tree@...ts.infradead.org
Subject: Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves

On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full.

> While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.

I initially thought we need counters for empty sheaves to see how many times
it grabs empty sheaves from the barn, but looks like barn_put
("put full sheaves to the barn") is effectively a proxy for that, right?

> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
> ---

Reviewed-by: Harry Yoo <harry.yoo@...cle.com>

LGTM, with a few nits:

>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1053 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1044 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
> 	 * %NULL means no constructor.
> 	 */
> 	void (*ctor)(void *);
>	/**
>	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
>	 *
>	 * With a non-zero value, allocations from the cache go through caching
>	 * arrays called sheaves. Each cpu has a main sheaf that's always
>	 * present, and a spare sheaf thay may be not present. When both become
>	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
>	 * from the per-node barn.
>	 *
>	 * When no full sheaf is available, and gfp flags allow blocking, a
>	 * sheaf is allocated and filled from slab(s) using bulk allocation.
>	 * Otherwise the allocation falls back to the normal operation
>	 * allocating a single object from a slab.
>	 *
>	 * Analogically when freeing and both percpu sheaves are full, the barn
>	 * may replace it with an empty sheaf, unless it's over capacity. In
>	 * that case a sheaf is bulk freed to slab pages.
>	 *
>	 * The sheaves do not enforce NUMA placement of objects, so allocations
>	 * via kmem_cache_alloc_node() with a node specified other than
>	 * NUMA_NO_NODE will bypass them.
>	 *
>	 * Bulk allocation and free operations also try to use the cpu sheaves
>	 * and barn, but fallback to using slab pages directly.
>	 *
>	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
>	 * is ignored.
>	 *
>	 * %0 means no sheaves will be created

nit: created -> created. (with a full stop)

>	 */
>	unsigned int sheaf_capacity;

> diff --git a/mm/slub.c b/mm/slub.c
> index dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct slub_percpu_sheaves *pcs;
> +
> +		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +		/* can happen when unwinding failed create */
> +		if (!pcs->main)
> +			continue;
> +
> +		/*
> +		 * We have already passed __kmem_cache_shutdown() so everything
> +		 * was flushed and there should be no objects allocated from
> +		 * slabs, otherwise kmem_cache_destroy() would have aborted.
> +		 * Therefore something would have to be really wrong if the
> +		 * warnings here trigger, and we should rather leave bojects and

nit: bojects -> objects

> +		 * sheaves to leak in that case.
> +		 */
> +
> +		WARN_ON(pcs->spare);
> +
> +		if (!WARN_ON(pcs->main->size)) {
> +			free_empty_sheaf(s, pcs->main);
> +			pcs->main = NULL;
> +		}
> +	}
> +
> +	free_percpu(s->cpu_sheaves);
> +	s->cpu_sheaves = NULL;
> +}
> +
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to

nit: a empty sheaf -> an empty sheaf

> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
>
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +
> +restart:
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		return false;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +		struct slab_sheaf *empty;
> +
> +		if (!pcs->spare) {
> +			empty = barn_get_empty_sheaf(pcs->barn);
> +			if (empty) {
> +				pcs->spare = pcs->main;
> +				pcs->main = empty;
> +				goto do_free;
> +			}
> +			goto alloc_empty;
> +		}
> +
> +		if (pcs->spare->size < s->sheaf_capacity) {
> +			swap(pcs->main, pcs->spare);
> +			goto do_free;
> +		}
> +
> +		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +		if (!IS_ERR(empty)) {
> +			stat(s, BARN_PUT);
> +			pcs->main = empty;
> +			goto do_free;
> +		}

nit: stat(s, BARN_PUT_FAIL); should probably be here instead?

> +
> +		if (PTR_ERR(empty) == -E2BIG) {
> +			/* Since we got here, spare exists and is full */
> +			struct slab_sheaf *to_flush = pcs->spare;
> +
> +			stat(s, BARN_PUT_FAIL);
> +
> +			pcs->spare = NULL;
> +			local_unlock(&s->cpu_sheaves->lock);
> +
> +			sheaf_flush_unused(s, to_flush);
> +			empty = to_flush;
> +			goto got_empty;
> +		}

> @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>  
>  	set_cpu_partial(s);
>  
> +	if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
> +		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);

nit: Probably you want to disable sheaves on CONFIG_SLUB_TINY=y too?

> +		if (!s->cpu_sheaves) {
> +			err = -ENOMEM;
> +			goto out;
> +		}
> +		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> +		s->sheaf_capacity = args->sheaf_capacity;
> +	}
> +

-- 
Cheers,
Harry / Hyeonggon