linux-kernel - Re: [PATCH v7 03/21] slab: add opt-in caching layer of percpu sheaves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aL672Jeqi99atefN@hyeyoo>
Date: Mon, 8 Sep 2025 20:19:52 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Christoph Lameter <cl@...two.org>,
        David Rientjes <rientjes@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Uladzislau Rezki <urezki@...il.com>,
        Sidhartha Kumar <sidhartha.kumar@...cle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
        maple-tree@...ts.infradead.org,
        Venkat Rao Bagalkote <venkat88@...ux.ibm.com>
Subject: Re: [PATCH v7 03/21] slab: add opt-in caching layer of percpu sheaves

On Wed, Sep 03, 2025 at 02:59:45PM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full. While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
> 
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed. Similarly, we ignore it
> with CONFIG_SLUB_TINY which prefers low memory usage to performance.
> 
> [boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ]
> Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@...ux.ibm.com>
> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
> ---
>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1164 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1143 insertions(+), 59 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index bfe7c40eeee1a01c175766935c1e3c0304434a53..e2b197e47866c30acdbd1fee4159f262a751c5a7 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
>  		return 1;
>  #endif
>  
> +	if (s->cpu_sheaves)
> +		return 1;
> +
>  	/*
>  	 * We may have set a slab to be unmergeable during bootstrap.
>  	 */
> @@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
>  		    object_size - args->usersize < args->useroffset))
>  		args->usersize = args->useroffset = 0;
>  
> -	if (!args->usersize)
> +	if (!args->usersize && !args->sheaf_capacity)
>  		s = __kmem_cache_alias(name, object_size, args->align, flags,
>  				       args->ctor);

Can we merge caches that use sheaves in the future if the capacity
is the same, or are there any restrictions for merging that I overlooked?

>  /*
>   * Slab allocation and freeing
>   */
> @@ -3344,11 +3748,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>  	put_partials_cpu(s, c);
>  }
>  
> -struct slub_flush_work {
> -	struct work_struct work;
> -	struct kmem_cache *s;
> -	bool skip;
> -};
> +static inline void flush_this_cpu_slab(struct kmem_cache *s)
> +{
> +	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> +
> +	if (c->slab)
> +		flush_slab(s, c);
> +
> +	put_partials(s);
> +}
> +
> +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> +{
> +	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +
> +	return c->slab || slub_percpu_partial(c);
> +}
> +
> +#else /* CONFIG_SLUB_TINY */
> +static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> +static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
> +static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
> +#endif /* CONFIG_SLUB_TINY */
> +
> +static bool has_pcs_used(int cpu, struct kmem_cache *s)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +
> +	if (!s->cpu_sheaves)
> +		return false;
> +
> +	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +	return (pcs->spare || pcs->main->size);
> +}
> +
> +static void pcs_flush_all(struct kmem_cache *s);

nit: we don't need these functions to flush sheaves if SLUB_TINY=y
as we don't create sheaves for SLUB_TINY anymore?

>  /*
>   * Flush cpu slab.
> @@ -3358,30 +3793,18 @@ struct slub_flush_work {
>  static void flush_cpu_slab(struct work_struct *w)
>  {
>  	struct kmem_cache *s;
> -	struct kmem_cache_cpu *c;
>  	struct slub_flush_work *sfw;
>  
>  	sfw = container_of(w, struct slub_flush_work, work);
>  
>  	s = sfw->s;
> -	c = this_cpu_ptr(s->cpu_slab);
> -
> -	if (c->slab)
> -		flush_slab(s, c);
> -
> -	put_partials(s);
> -}
>  
> -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> -{
> -	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +	if (s->cpu_sheaves)
> +		pcs_flush_all(s);
>  
> -	return c->slab || slub_percpu_partial(c);
> +	flush_this_cpu_slab(s);
>  } 
> -#else /* CONFIG_SLUB_TINY */
> -static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
> -static inline void flush_all(struct kmem_cache *s) { }
> -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> -static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
> -#endif /* CONFIG_SLUB_TINY */
> -
>  /*
>   * Check if the objects in a per cpu structure fit numa
>   * locality expectations.
> @@ -4191,30 +4610,240 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  }
>  
>  /*
> - * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
> - * have the fastpath folded into their functions. So no function call
> - * overhead for requests that can be satisfied on the fastpath.
> - *
> - * The fastpath works by first checking if the lockless freelist can be used.
> - * If not then __slab_alloc is called for slow processing.
> + * Replace the empty main sheaf with a (at least partially) full sheaf.
>   *
> - * Otherwise we can simply pick the next object from the lockless free list.
> + * Must be called with the cpu_sheaves local lock locked. If successful, returns
> + * the pcs pointer and the local lock locked (possibly on a different cpu than
> + * initially called). If not successful, returns NULL and the local lock
> + * unlocked.
>   */
> -static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
> -		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> +static struct slub_percpu_sheaves *
> +__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
>  {
> -	void *object;
> -	bool init = false;
> +	struct slab_sheaf *empty = NULL;
> +	struct slab_sheaf *full;
> +	struct node_barn *barn;
> +	bool can_alloc;
>  
> -	s = slab_pre_alloc_hook(s, gfpflags);
> -	if (unlikely(!s))
> +	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +
> +	if (pcs->spare && pcs->spare->size > 0) {
> +		swap(pcs->main, pcs->spare);
> +		return pcs;
> +	}
> +
> +	barn = get_barn(s);
> +
> +	full = barn_replace_empty_sheaf(barn, pcs->main);
> +
> +	if (full) {
> +		stat(s, BARN_GET);
> +		pcs->main = full;
> +		return pcs;
> +	}
> +
> +	stat(s, BARN_GET_FAIL);
> +
> +	can_alloc = gfpflags_allow_blocking(gfp);
> +
> +	if (can_alloc) {
> +		if (pcs->spare) {
> +			empty = pcs->spare;
> +			pcs->spare = NULL;
> +		} else {
> +			empty = barn_get_empty_sheaf(barn);
> +		}
> +	}
> +
> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	if (!can_alloc)
> +		return NULL;
> +
> +	if (empty) {
> +		if (!refill_sheaf(s, empty, gfp)) {
> +			full = empty;
> +		} else {
> +			/*
> +			 * we must be very low on memory so don't bother
> +			 * with the barn
> +			 */
> +			free_empty_sheaf(s, empty);
> +		}
> +	} else {
> +		full = alloc_full_sheaf(s, gfp);
> +	}
> +
> +	if (!full)
> +		return NULL;
> +
> +	/*
> +	 * we can reach here only when gfpflags_allow_blocking
> +	 * so this must not be an irq
> +	 */
> +	local_lock(&s->cpu_sheaves->lock);
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	/*
> +	 * If we are returning empty sheaf, we either got it from the
> +	 * barn or had to allocate one. If we are returning a full
> +	 * sheaf, it's due to racing or being migrated to a different
> +	 * cpu. Breaching the barn's sheaf limits should be thus rare
> +	 * enough so just ignore them to simplify the recovery.
> +	 */
> +
> +	if (pcs->main->size == 0) {
> +		barn_put_empty_sheaf(barn, pcs->main);

It should be very rare but it should do
barn = get_barn(s); again after taking s->cpu_sheaves->lock?

> +		pcs->main = full;
> +		return pcs;
> +	}
> +
> +	if (!pcs->spare) {
> +		pcs->spare = full;
> +		return pcs;
> +	}
> +
> +	if (pcs->spare->size == 0) {
> +		barn_put_empty_sheaf(barn, pcs->spare);
> +		pcs->spare = full;
> +		return pcs;
> +	}
> +
> +	barn_put_full_sheaf(barn, full);
> +	stat(s, BARN_PUT);
> +
> +	return pcs;
> +}
> @@ -4591,6 +5220,295 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * Replace the full main sheaf with a (at least partially) empty sheaf.
> + *
> + * Must be called with the cpu_sheaves local lock locked. If successful, returns
> + * the pcs pointer and the local lock locked (possibly on a different cpu than
> + * initially called). If not successful, returns NULL and the local lock
> + * unlocked.
> + */
> +static struct slub_percpu_sheaves *
> +__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> +{
> +	struct slab_sheaf *empty;
> +	struct node_barn *barn;
> +	bool put_fail;
> +
> +restart:
> +	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +
> +	barn = get_barn(s);
> +	put_fail = false;
> +
> +	if (!pcs->spare) {
> +		empty = barn_get_empty_sheaf(barn);
> +		if (empty) {
> +			pcs->spare = pcs->main;
> +			pcs->main = empty;
> +			return pcs;
> +		}
> +		goto alloc_empty;
> +	}
> +
> +	if (pcs->spare->size < s->sheaf_capacity) {
> +		swap(pcs->main, pcs->spare);
> +		return pcs;
> +	}
> +
> +	empty = barn_replace_full_sheaf(barn, pcs->main);
> +
> +	if (!IS_ERR(empty)) {
> +		stat(s, BARN_PUT);
> +		pcs->main = empty;
> +		return pcs;
> +	}
> +
> +	if (PTR_ERR(empty) == -E2BIG) {
> +		/* Since we got here, spare exists and is full */
> +		struct slab_sheaf *to_flush = pcs->spare;
> +
> +		stat(s, BARN_PUT_FAIL);
> +
> +		pcs->spare = NULL;
> +		local_unlock(&s->cpu_sheaves->lock);
> +
> +		sheaf_flush_unused(s, to_flush);
> +		empty = to_flush;
> +		goto got_empty;
> +	}
> +
> +	/*
> +	 * We could not replace full sheaf because barn had no empty
> +	 * sheaves. We can still allocate it and put the full sheaf in
> +	 * __pcs_install_empty_sheaf(), but if we fail to allocate it,
> +	 * make sure to count the fail.
> +	 */
> +	put_fail = true;
> +
> +alloc_empty:
> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +	if (empty)
> +		goto got_empty;
> +
> +	if (put_fail)
> +		 stat(s, BARN_PUT_FAIL);
> +
> +	if (!sheaf_flush_main(s))
> +		return NULL;
> +
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		return NULL;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	/*
> +	 * we flushed the main sheaf so it should be empty now,
> +	 * but in case we got preempted or migrated, we need to
> +	 * check again
> +	 */
> +	if (pcs->main->size == s->sheaf_capacity)
> +		goto restart;
> +
> +	return pcs;
> +
> +got_empty:
> +	if (!local_trylock(&s->cpu_sheaves->lock)) {
> +		barn_put_empty_sheaf(barn, empty);

Same here, we might have gotten migrated to a different node.

> +		return NULL;
> +	}
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +	__pcs_install_empty_sheaf(s, pcs, empty);
> +
> +	return pcs;
> +}

Otherwise looks good to me!

-- 
Cheers,
Harry / Hyeonggon