[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJuCfpFTMQD6oyR_Q1ds7XL4Km7h2mmzSv4z7f5fFnQ14=+g_A@mail.gmail.com>
Date: Thu, 27 Nov 2025 11:29:10 -0800
From: Suren Baghdasaryan <surenb@...gle.com>
To: Daniel Gomez <da.gomez@...nel.org>
Cc: Vlastimil Babka <vbabka@...e.cz>, Harry Yoo <harry.yoo@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Christoph Lameter <cl@...two.org>,
David Rientjes <rientjes@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>,
Uladzislau Rezki <urezki@...il.com>, Sidhartha Kumar <sidhartha.kumar@...cle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
maple-tree@...ts.infradead.org, linux-modules@...r.kernel.org,
bpf@...r.kernel.org, Luis Chamberlain <mcgrof@...nel.org>, Petr Pavlu <petr.pavlu@...e.com>,
Sami Tolvanen <samitolvanen@...gle.com>, Aaron Tomlin <atomlin@...mlin.com>,
Lucas De Marchi <lucas.demarchi@...el.com>
Subject: Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
On Thu, Nov 27, 2025 at 6:01 AM Daniel Gomez <da.gomez@...nel.org> wrote:
>
>
>
> On 05/11/2025 12.25, Vlastimil Babka wrote:
> > On 11/3/25 04:17, Harry Yoo wrote:
> >> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
> >>>
> >>>
> >>> On 10/09/2025 10.01, Vlastimil Babka wrote:
> >>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> >>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> >>>> addition to main and spare sheaves.
> >>>>
> >>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> >>>> the sheaf is detached and submitted to call_rcu() with a handler that
> >>>> will try to put it in the barn, or flush to slab pages using bulk free,
> >>>> when the barn is full. Then a new empty sheaf must be obtained to put
> >>>> more objects there.
> >>>>
> >>>> It's possible that no free sheaves are available to use for a new
> >>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> >>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> >>>> kfree_rcu() implementation.
> >>>>
> >>>> Expected advantages:
> >>>> - batching the kfree_rcu() operations, that could eventually replace the
> >>>> existing batching
> >>>> - sheaves can be reused for allocations via barn instead of being
> >>>> flushed to slabs, which is more efficient
> >>>> - this includes cases where only some cpus are allowed to process rcu
> >>>> callbacks (Android)
> >>>>
> >>>> Possible disadvantage:
> >>>> - objects might be waiting for more than their grace period (it is
> >>>> determined by the last object freed into the sheaf), increasing memory
> >>>> usage - but the existing batching does that too.
> >>>>
> >>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> >>>> implementation favors smaller memory footprint over performance.
> >>>>
> >>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> >>>> contexts where kfree_rcu() is called might not be compatible with taking
> >>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> >>>> spinlock - the current kfree_rcu() implementation avoids doing that.
> >>>>
> >>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> >>>> that have them. This is not a cheap operation, but the barrier usage is
> >>>> rare - currently kmem_cache_destroy() or on module unload.
> >>>>
> >>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> >>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> >>>> many had to fall back to the existing implementation.
> >>>>
> >>>> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
> >>>
> >>> Hi Vlastimil,
> >>>
> >>> This patch increases kmod selftest (stress module loader) runtime by about
> >>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> >>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> >>> causing this, or how to address it?
> >>
> >> This is likely due to increased kvfree_rcu_barrier() during module unload.
> >
> > Hm so there are actually two possible sources of this. One is that the
> > module creates some kmem_cache and calls kmem_cache_destroy() on it before
> > unloading. That does kvfree_rcu_barrier() which iterates all caches via
> > flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> > have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> > that single cache.
>
> Thanks for the feedback. And thanks to Jon who has revived this again.
>
> >
> > The other source is codetag_unload_module(), and I'm afraid it's this one as
> > it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?
>
> Yes, we do have that enabled.
Sorry I missed this discussion before.
IIUC, the performance is impacted because kvfree_rcu_barrier() has to
flush_all_rcu_sheaves(), therefore is more costly than before.
>
> > Disabling it should help in this case, if you don't need memory allocation
> > profiling for that stress test. I think there's some space for improvement -
> > when compiled in but memalloc profiling never enabled during the uptime,
> > this could probably be skipped? Suren?
I think yes, we should be able to skip kvfree_rcu_barrier() inside
codetag_unload_module() if profiling was not enabled.
kvfree_rcu_barrier() is there to ensure all potential kfree_rcu()'s
for module allocations are finished before destroying the tags. I'll
need to add an additional "sticky" flag to record that profiling was
used so that we detect a case when it was enabled, then disabled
before module unloading. I can work on it next week.
> >
> >> It currently iterates over all CPUs x slab caches (that enabled sheaves,
> >> there should be only a few now) pair to make sure rcu sheaf is flushed
> >> by the time kvfree_rcu_barrier() returns.
> >
> > Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> > multiple modules in parallel? That would make things worse, although I'd
> > expect there's a lot serialization in this area already.
>
> AFAIK, the kmod stress test does not unload modules in parallel. Module unload
> happens one at a time before each test iteration. However, test 0008 and 0009
> run 300 total sequential module unloads.
>
> ALL_TESTS="$ALL_TESTS 0008:150:1"
> ALL_TESTS="$ALL_TESTS 0009:150:1"
>
> >
> > Unfortunately it will get worse with sheaves extended to all caches. We
> > could probably mark caches once they allocate their first rcu_free sheaf
> > (should not add visible overhead) and keep skipping those that never did.
> >> Just being curious, do you have any serious workload that depends on
> >> the performance of module unload?
>
> Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
> Happy to test this again if you have a patch or something in mind.
>
> In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
> folks in case they have a workload.
>
> But I don't have a particular workload in mind.
Powered by blists - more mailing lists