linux-kernel - Re: [PATCH v7 04/21] slab: add sheaf support for batching kfree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f8274da-a010-4bb3-b3d6-690481b5ace0@suse.cz>
Date: Mon, 8 Sep 2025 14:45:11 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>,
 Roman Gushchin <roman.gushchin@...ux.dev>, Harry Yoo <harry.yoo@...cle.com>,
 Sidhartha Kumar <sidhartha.kumar@...cle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
 maple-tree@...ts.infradead.org
Subject: Re: [PATCH v7 04/21] slab: add sheaf support for batching kfree_rcu()
 operations

On 9/8/25 13:59, Uladzislau Rezki wrote:
> On Wed, Sep 03, 2025 at 02:59:46PM +0200, Vlastimil Babka wrote:
>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> addition to main and spare sheaves.
>> 
>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> the sheaf is detached and submitted to call_rcu() with a handler that
>> will try to put it in the barn, or flush to slab pages using bulk free,
>> when the barn is full. Then a new empty sheaf must be obtained to put
>> more objects there.
>> 
>> It's possible that no free sheaves are available to use for a new
>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> kfree_rcu() implementation.
>> 
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>   existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>   flushed to slabs, which is more efficient
>>   - this includes cases where only some cpus are allowed to process rcu
>>     callbacks (Android)
>> 
>> Possible disadvantage:
>> - objects might be waiting for more than their grace period (it is
>>   determined by the last object freed into the sheaf), increasing memory
>>   usage - but the existing batching does that too.
>> 
>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> implementation favors smaller memory footprint over performance.
>> 
>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> many had to fall back to the existing implementation.
>> 
>> Reviewed-by: Harry Yoo <harry.yoo@...cle.com>
>> Reviewed-by: Suren Baghdasaryan <surenb@...gle.com>
>> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
>> ---
>>  mm/slab.h        |   2 +
>>  mm/slab_common.c |  24 +++++++
>>  mm/slub.c        | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 216 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/slab.h b/mm/slab.h
>> index 206987ce44a4d053ebe3b5e50784d2dd23822cd1..f1866f2d9b211bb0d7f24644b80ef4b50a7c3d24 100644
>> --- a/mm/slab.h
>> +++ b/mm/slab.h
>> @@ -435,6 +435,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>>  	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>>  }
>>  
>> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
>> +
>>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
>>  			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index e2b197e47866c30acdbd1fee4159f262a751c5a7..2d806e02568532a1000fd3912db6978e945dcfa8 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
>>  		kvfree_rcu_list(head);
>>  }
>>  
>> +static bool kfree_rcu_sheaf(void *obj)
>> +{
>> +	struct kmem_cache *s;
>> +	struct folio *folio;
>> +	struct slab *slab;
>> +
>> +	if (is_vmalloc_addr(obj))
>> +		return false;
>> +
>> +	folio = virt_to_folio(obj);
>> +	if (unlikely(!folio_test_slab(folio)))
>> +		return false;
>> +
>> +	slab = folio_slab(folio);
>> +	s = slab->slab_cache;
>> +	if (s->cpu_sheaves)
>> +		return __kfree_rcu_sheaf(s, obj);
>> +
>> +	return false;
>> +}
>> +
>>  static bool
>>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>>  {
>> @@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>>  	if (!head)
>>  		might_sleep();
>>  
>> +	if (kfree_rcu_sheaf(ptr))
>> +		return;
>> +
> Uh.. I have some concerns about this.
> 
> This patch introduces a new path which is a collision to the
> existing kvfree_rcu() logic. It implements some batching which
> we already have.

Yes but for caches with sheaves it's better to recycle the whole sheaf (as
described), which is so different from the existing batching scheme that I'm
not sure if there's a sensible way to combine them.

> - kvfree_rcu_barrier() does not know about "sheaf" path. Am i missing
>   something? How do you guarantee that kvfree_rcu_barrier() flushes
>   sheafs? If it is part of kvfree_rcu() it has to care about this.

Hm good point, thanks. I've taken care of handling flushing related to
kfree_rcu() sheaves in kmem_cache_destroy(), but forgot that
kvfree_rcu_barrier() can be also used outside of that - we have one user in
codetag_unload_module() currently.

> - we do not allocate in kvfree_rcu() path because of PREEMMPT_RT, i.e.
>   kvfree_rcu() is supposed it can be called from the non-sleeping contexts.

Hm I could not find where that distinction is in the code, can you give a
hint please. In __kfree_rcu_sheaf() I do only have a GFP_NOWAIT attempt.

> - call_rcu() can be slow, therefore we do not use it in the kvfree_rcu().

If call_rcu() is called once per 32 kfree_rcu() filling up the rcu sheaf, is
it still too slow?
> IMO, it is worth to reuse existing logic in the kvfree_rcu(). I can help
> with it when i have more cycles as part of my RCU work.
It would be most welcome! I'd suggest we currently proceed with this after I
fix up kvfree_rcu_barrier(), and can attempt to consolidate later.

> --
> Uladzislau Rezki