[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBAmi38oWka6ckjk@harry>
Date: Tue, 29 Apr 2025 10:08:27 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Christoph Lameter <cl@...ux.com>, David Rientjes <rientjes@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Uladzislau Rezki <urezki@...il.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
maple-tree@...ts.infradead.org
Subject: Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
>
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
>
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
>
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
>
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
>
> The sheaf_capacity value is exported in sysfs for observability.
>
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs. For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly because the
> barn was either empty or full.
> While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters. Finally, there are sheaf_alloc/sheaf_free counters.
I initially thought we need counters for empty sheaves to see how many times
it grabs empty sheaves from the barn, but looks like barn_put
("put full sheaves to the barn") is effectively a proxy for that, right?
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
>
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed.
>
> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
> ---
Reviewed-by: Harry Yoo <harry.yoo@...cle.com>
LGTM, with a few nits:
> include/linux/slab.h | 31 ++
> mm/slab.h | 2 +
> mm/slab_common.c | 5 +-
> mm/slub.c | 1053 +++++++++++++++++++++++++++++++++++++++++++++++---
> 4 files changed, 1044 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
> * %NULL means no constructor.
> */
> void (*ctor)(void *);
> /**
> * @sheaf_capacity: Enable sheaves of given capacity for the cache.
> *
> * With a non-zero value, allocations from the cache go through caching
> * arrays called sheaves. Each cpu has a main sheaf that's always
> * present, and a spare sheaf thay may be not present. When both become
> * empty, there's an attempt to replace an empty sheaf with a full sheaf
> * from the per-node barn.
> *
> * When no full sheaf is available, and gfp flags allow blocking, a
> * sheaf is allocated and filled from slab(s) using bulk allocation.
> * Otherwise the allocation falls back to the normal operation
> * allocating a single object from a slab.
> *
> * Analogically when freeing and both percpu sheaves are full, the barn
> * may replace it with an empty sheaf, unless it's over capacity. In
> * that case a sheaf is bulk freed to slab pages.
> *
> * The sheaves do not enforce NUMA placement of objects, so allocations
> * via kmem_cache_alloc_node() with a node specified other than
> * NUMA_NO_NODE will bypass them.
> *
> * Bulk allocation and free operations also try to use the cpu sheaves
> * and barn, but fallback to using slab pages directly.
> *
> * When slub_debug is enabled for the cache, the sheaf_capacity argument
> * is ignored.
> *
> * %0 means no sheaves will be created
nit: created -> created. (with a full stop)
> */
> unsigned int sheaf_capacity;
> diff --git a/mm/slub.c b/mm/slub.c
> index dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct slub_percpu_sheaves *pcs;
> +
> + pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> + /* can happen when unwinding failed create */
> + if (!pcs->main)
> + continue;
> +
> + /*
> + * We have already passed __kmem_cache_shutdown() so everything
> + * was flushed and there should be no objects allocated from
> + * slabs, otherwise kmem_cache_destroy() would have aborted.
> + * Therefore something would have to be really wrong if the
> + * warnings here trigger, and we should rather leave bojects and
nit: bojects -> objects
> + * sheaves to leak in that case.
> + */
> +
> + WARN_ON(pcs->spare);
> +
> + if (!WARN_ON(pcs->main->size)) {
> + free_empty_sheaf(s, pcs->main);
> + pcs->main = NULL;
> + }
> + }
> +
> + free_percpu(s->cpu_sheaves);
> + s->cpu_sheaves = NULL;
> +}
> +
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to
nit: a empty sheaf -> an empty sheaf
> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
>
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> discard_slab(s, slab);
> }
>
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> + struct slub_percpu_sheaves *pcs;
> +
> +restart:
> + if (!local_trylock(&s->cpu_sheaves->lock))
> + return false;
> +
> + pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> + if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> + struct slab_sheaf *empty;
> +
> + if (!pcs->spare) {
> + empty = barn_get_empty_sheaf(pcs->barn);
> + if (empty) {
> + pcs->spare = pcs->main;
> + pcs->main = empty;
> + goto do_free;
> + }
> + goto alloc_empty;
> + }
> +
> + if (pcs->spare->size < s->sheaf_capacity) {
> + swap(pcs->main, pcs->spare);
> + goto do_free;
> + }
> +
> + empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> + if (!IS_ERR(empty)) {
> + stat(s, BARN_PUT);
> + pcs->main = empty;
> + goto do_free;
> + }
nit: stat(s, BARN_PUT_FAIL); should probably be here instead?
> +
> + if (PTR_ERR(empty) == -E2BIG) {
> + /* Since we got here, spare exists and is full */
> + struct slab_sheaf *to_flush = pcs->spare;
> +
> + stat(s, BARN_PUT_FAIL);
> +
> + pcs->spare = NULL;
> + local_unlock(&s->cpu_sheaves->lock);
> +
> + sheaf_flush_unused(s, to_flush);
> + empty = to_flush;
> + goto got_empty;
> + }
> @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>
> set_cpu_partial(s);
>
> + if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
> + s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
nit: Probably you want to disable sheaves on CONFIG_SLUB_TINY=y too?
> + if (!s->cpu_sheaves) {
> + err = -ENOMEM;
> + goto out;
> + }
> + // TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> + s->sheaf_capacity = args->sheaf_capacity;
> + }
> +
--
Cheers,
Harry / Hyeonggon
Powered by blists - more mailing lists