linux-kernel - Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-5HWApFjrOr7Q8_@harry>
Date: Thu, 3 Apr 2025 17:31:20 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Christoph Lameter <cl@...ux.com>, David Rientjes <rientjes@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Uladzislau Rezki <urezki@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
        maple-tree@...ts.infradead.org
Subject: Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu
 sheaves

On Mon, Mar 17, 2025 at 03:33:03PM +0100, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> refill one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with a specific
> node (not NUMA_NO_NODE), sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> allocated or freed using the sheaves. Counters sheaf_refill,
> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> from or to slab pages, and can be used to assess how effective the
> caching is. The refill and flush operations will also count towards the
> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> counters.
> 
> Access to the percpu sheaves is protected by localtry_trylock() when
> potential callers include irq context, and localtry_lock() otherwise
> (such as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> A current limitation is that when slub_debug is enabled for a cache with
> percpu sheaves, the objects in the array are considered as allocated from
> the slub_debug perspective, and the alloc/free debugging hooks occur
> when moving the objects between the array and slab pages. This means
> that e.g. an use-after-free that occurs for an object cached in the
> array is undetected. Collected alloc/free stacktraces might also be less
> useful. This limitation could be changed in the future.
> 
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
> 
> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
> ---
>  include/linux/slab.h |   34 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1029 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1020 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -453,12 +489,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>   */
>  static nodemask_t slab_nodes;
>  
> -#ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
>   */
>  static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> +	struct work_struct work;
> +	struct kmem_cache *s;
> +	bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>  
>  /********************************************************************
>   * 			Core slab cache functions
> @@ -2410,6 +2453,358 @@ static void *setup_object(struct kmem_cache *s, void *object)
>  	return object;
>  }

> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{

[...snip...]

> +next_batch:
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		goto fallback;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +		struct slab_sheaf *empty;
> +
> +		if (!pcs->spare) {
> +			empty = barn_get_empty_sheaf(pcs->barn);
> +			if (empty) {
> +				pcs->spare = pcs->main;
> +				pcs->main = empty;
> +				goto do_free;
> +			}
> +			goto no_empty;

Maybe a silly question, but if neither of alloc_from_pcs_bulk() or
free_to_pcs_bulk() allocates empty sheaves (and sometimes put empty or full
sheaves in the barn), you should expect usually sheaves not to be in the barn
when using bulk interfces?

> +		}
> +
> +		if (pcs->spare->size < s->sheaf_capacity) {
> +			stat(s, SHEAF_SWAP);
> +			swap(pcs->main, pcs->spare);
> +			goto do_free;
> +		}
> +
> +		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +		if (!IS_ERR(empty)) {
> +			pcs->main = empty;
> +			goto do_free;
> +		}
> +
> +no_empty:
> +		localtry_unlock(&s->cpu_sheaves->lock);
> +
> +		/*
> +		 * if we depleted all empty sheaves in the barn or there are too
> +		 * many full sheaves, free the rest to slab pages
> +		 */
> +fallback:
> +		__kmem_cache_free_bulk(s, size, p);
> +		return;
> +	}
> +
> +do_free:
> +	main = pcs->main;
> +	batch = min(size, s->sheaf_capacity - main->size);
> +
> +	memcpy(main->objects + main->size, p, batch * sizeof(void *));
> +	main->size += batch;
> +
> +	localtry_unlock(&s->cpu_sheaves->lock);
> +
> +	stat_add(s, FREE_PCS, batch);
> +
> +	if (batch < size) {
> +		p += batch;
> +		size -= batch;
> +		goto next_batch;
> +	}
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -5309,8 +6145,8 @@ static inline int calculate_order(unsigned int size)
>  	return -ENOSYS;
>  }
>  
> -static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +static bool
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  {

Why is the return type bool, when it always succeeds?

>  	n->nr_partial = 0;
>  	spin_lock_init(&n->list_lock);
> @@ -5320,6 +6156,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>  	atomic_long_set(&n->total_objects, 0);
>  	INIT_LIST_HEAD(&n->full);
>  #endif
> +	n->barn = barn;
> +	if (barn)
> +		barn_init(barn);
> +
> +	return true;
>  }
>  
>  #ifndef CONFIG_SLUB_TINY
> @@ -5385,7 +6250,7 @@ static void early_kmem_cache_node_alloc(int node)
>  	slab->freelist = get_freepointer(kmem_cache_node, n);
>  	slab->inuse = 1;
>  	kmem_cache_node->node[node] = n;
> -	init_kmem_cache_node(n);
> +	init_kmem_cache_node(n, NULL);
>  	inc_slabs_node(kmem_cache_node, node, slab->objects);
>  
>  	/*
> @@ -5421,20 +6295,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>  
>  	for_each_node_mask(node, slab_nodes) {
>  		struct kmem_cache_node *n;
> +		struct node_barn *barn = NULL;
>  
>  		if (slab_state == DOWN) {
>  			early_kmem_cache_node_alloc(node);
>  			continue;
>  		}
> +
> +		if (s->cpu_sheaves) {
> +			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> +			if (!barn)
> +				return 0;
> +		}
> +
>  		n = kmem_cache_alloc_node(kmem_cache_node,
>  						GFP_KERNEL, node);
> -
> -		if (!n) {
> -			free_kmem_cache_nodes(s);
> +		if (!n)
>  			return 0;
> -		}

Looks like it's leaking the barn
if the allocation of kmem_cache_node fails?

> -		init_kmem_cache_node(n);
> +		init_kmem_cache_node(n, barn);
> +
>  		s->node[node] = n;
>  	}
>  	return 1;
> @@ -6005,12 +6891,24 @@ static int slab_mem_going_online_callback(void *arg)
>  	 */
>  	mutex_lock(&slab_mutex);
>  	list_for_each_entry(s, &slab_caches, list) {
> +		struct node_barn *barn = NULL;
> +
>  		/*
>  		 * The structure may already exist if the node was previously
>  		 * onlined and offlined.
>  		 */
>  		if (get_node(s, nid))
>  			continue;
> +
> +		if (s->cpu_sheaves) {
> +			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> +			if (!barn) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +

Ditto.

Otherwise looks good to me :)

>  		/*
>  		 * XXX: kmem_cache_alloc_node will fallback to other nodes
>  		 *      since memory is not yet available from the node that
> @@ -6021,7 +6919,9 @@ static int slab_mem_going_online_callback(void *arg)
>  			ret = -ENOMEM;
>  			goto out;
>  		}
> -		init_kmem_cache_node(n);
> +
> +		init_kmem_cache_node(n, barn);
> +
>  		s->node[nid] = n;
>  	}
>  	/*

-- 
Cheers,
Harry (formerly known as Hyeonggon)