[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-5HWApFjrOr7Q8_@harry>
Date: Thu, 3 Apr 2025 17:31:20 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Christoph Lameter <cl@...ux.com>, David Rientjes <rientjes@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Uladzislau Rezki <urezki@...il.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
maple-tree@...ts.infradead.org
Subject: Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu
sheaves
On Mon, Mar 17, 2025 at 03:33:03PM +0100, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
>
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> refill one of the sheaves.
>
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
>
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with a specific
> node (not NUMA_NO_NODE), sheaves are bypassed.
>
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
>
> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> allocated or freed using the sheaves. Counters sheaf_refill,
> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> from or to slab pages, and can be used to assess how effective the
> caching is. The refill and flush operations will also count towards the
> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> counters.
>
> Access to the percpu sheaves is protected by localtry_trylock() when
> potential callers include irq context, and localtry_lock() otherwise
> (such as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
>
> A current limitation is that when slub_debug is enabled for a cache with
> percpu sheaves, the objects in the array are considered as allocated from
> the slub_debug perspective, and the alloc/free debugging hooks occur
> when moving the objects between the array and slab pages. This means
> that e.g. an use-after-free that occurs for an object cached in the
> array is undetected. Collected alloc/free stacktraces might also be less
> useful. This limitation could be changed in the future.
>
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
>
> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
> ---
> include/linux/slab.h | 34 ++
> mm/slab.h | 2 +
> mm/slab_common.c | 5 +-
> mm/slub.c | 1029 +++++++++++++++++++++++++++++++++++++++++++++++---
> 4 files changed, 1020 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -453,12 +489,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> */
> static nodemask_t slab_nodes;
>
> -#ifndef CONFIG_SLUB_TINY
> /*
> * Workqueue used for flush_cpu_slab().
> */
> static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> + struct work_struct work;
> + struct kmem_cache *s;
> + bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>
> /********************************************************************
> * Core slab cache functions
> @@ -2410,6 +2453,358 @@ static void *setup_object(struct kmem_cache *s, void *object)
> return object;
> }
> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
[...snip...]
> +next_batch:
> + if (!localtry_trylock(&s->cpu_sheaves->lock))
> + goto fallback;
> +
> + pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> + if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> + struct slab_sheaf *empty;
> +
> + if (!pcs->spare) {
> + empty = barn_get_empty_sheaf(pcs->barn);
> + if (empty) {
> + pcs->spare = pcs->main;
> + pcs->main = empty;
> + goto do_free;
> + }
> + goto no_empty;
Maybe a silly question, but if neither of alloc_from_pcs_bulk() or
free_to_pcs_bulk() allocates empty sheaves (and sometimes put empty or full
sheaves in the barn), you should expect usually sheaves not to be in the barn
when using bulk interfces?
> + }
> +
> + if (pcs->spare->size < s->sheaf_capacity) {
> + stat(s, SHEAF_SWAP);
> + swap(pcs->main, pcs->spare);
> + goto do_free;
> + }
> +
> + empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> + if (!IS_ERR(empty)) {
> + pcs->main = empty;
> + goto do_free;
> + }
> +
> +no_empty:
> + localtry_unlock(&s->cpu_sheaves->lock);
> +
> + /*
> + * if we depleted all empty sheaves in the barn or there are too
> + * many full sheaves, free the rest to slab pages
> + */
> +fallback:
> + __kmem_cache_free_bulk(s, size, p);
> + return;
> + }
> +
> +do_free:
> + main = pcs->main;
> + batch = min(size, s->sheaf_capacity - main->size);
> +
> + memcpy(main->objects + main->size, p, batch * sizeof(void *));
> + main->size += batch;
> +
> + localtry_unlock(&s->cpu_sheaves->lock);
> +
> + stat_add(s, FREE_PCS, batch);
> +
> + if (batch < size) {
> + p += batch;
> + size -= batch;
> + goto next_batch;
> + }
> +}
> +
> #ifndef CONFIG_SLUB_TINY
> /*
> * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -5309,8 +6145,8 @@ static inline int calculate_order(unsigned int size)
> return -ENOSYS;
> }
>
> -static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +static bool
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
> {
Why is the return type bool, when it always succeeds?
> n->nr_partial = 0;
> spin_lock_init(&n->list_lock);
> @@ -5320,6 +6156,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
> atomic_long_set(&n->total_objects, 0);
> INIT_LIST_HEAD(&n->full);
> #endif
> + n->barn = barn;
> + if (barn)
> + barn_init(barn);
> +
> + return true;
> }
>
> #ifndef CONFIG_SLUB_TINY
> @@ -5385,7 +6250,7 @@ static void early_kmem_cache_node_alloc(int node)
> slab->freelist = get_freepointer(kmem_cache_node, n);
> slab->inuse = 1;
> kmem_cache_node->node[node] = n;
> - init_kmem_cache_node(n);
> + init_kmem_cache_node(n, NULL);
> inc_slabs_node(kmem_cache_node, node, slab->objects);
>
> /*
> @@ -5421,20 +6295,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>
> for_each_node_mask(node, slab_nodes) {
> struct kmem_cache_node *n;
> + struct node_barn *barn = NULL;
>
> if (slab_state == DOWN) {
> early_kmem_cache_node_alloc(node);
> continue;
> }
> +
> + if (s->cpu_sheaves) {
> + barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> + if (!barn)
> + return 0;
> + }
> +
> n = kmem_cache_alloc_node(kmem_cache_node,
> GFP_KERNEL, node);
> -
> - if (!n) {
> - free_kmem_cache_nodes(s);
> + if (!n)
> return 0;
> - }
Looks like it's leaking the barn
if the allocation of kmem_cache_node fails?
> - init_kmem_cache_node(n);
> + init_kmem_cache_node(n, barn);
> +
> s->node[node] = n;
> }
> return 1;
> @@ -6005,12 +6891,24 @@ static int slab_mem_going_online_callback(void *arg)
> */
> mutex_lock(&slab_mutex);
> list_for_each_entry(s, &slab_caches, list) {
> + struct node_barn *barn = NULL;
> +
> /*
> * The structure may already exist if the node was previously
> * onlined and offlined.
> */
> if (get_node(s, nid))
> continue;
> +
> + if (s->cpu_sheaves) {
> + barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> + if (!barn) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + }
> +
Ditto.
Otherwise looks good to me :)
> /*
> * XXX: kmem_cache_alloc_node will fallback to other nodes
> * since memory is not yet available from the node that
> @@ -6021,7 +6919,9 @@ static int slab_mem_going_online_callback(void *arg)
> ret = -ENOMEM;
> goto out;
> }
> - init_kmem_cache_node(n);
> +
> + init_kmem_cache_node(n, barn);
> +
> s->node[nid] = n;
> }
> /*
--
Cheers,
Harry (formerly known as Hyeonggon)
Powered by blists - more mailing lists