linux-kernel - Re: [PATCH RFC 00/14] rcu: Reduce rnp->lock contention with per-CPU blocked task lists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <936afbad-226f-4209-b347-c87b30b4a7f3@joelfernandes.org>
Date: Tue, 6 Jan 2026 10:08:51 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: Joel Fernandes <joel@...lfernandes.org>, paulmck@...nel.org
Cc: linux-kernel@...r.kernel.org, Frederic Weisbecker <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
 Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang@...ux.dev>,
 Uladzislau Rezki <urezki@...il.com>, rcu@...r.kernel.org,
 Joel Fernandes <joelagnelf@...dia.com>
Subject: Re: [PATCH RFC 00/14] rcu: Reduce rnp->lock contention with per-CPU
 blocked task lists



On 1/5/2026 7:55 PM, Joel Fernandes wrote:
>> Also if so, would the following rather simpler patch do the same trick,
>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
>>
>> ------------------------------------------------------------------------
>>
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index 6a319e2926589..04dbee983b37d 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -198,9 +198,9 @@ config RCU_FANOUT
>>  
>>  config RCU_FANOUT_LEAF
>>  	int "Tree-based hierarchical RCU leaf-level fanout value"
>> -	range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
>> -	range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
>> -	range 2 3 if RCU_STRICT_GRACE_PERIOD
>> +	range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
>> +	range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
>> +	range 1 3 if RCU_STRICT_GRACE_PERIOD
>>  	depends on TREE_RCU && RCU_EXPERT>  	default 16 if !RCU_STRICT_GRACE_PERIOD
>>  	default 2 if RCU_STRICT_GRACE_PERIOD
>>
>> ------------------------------------------------------------------------
>>
>> This passes a quick 20-minute rcutorture smoke test.  Does it provide
>> similar performance benefits?
>
> I tried this out, and it also brings down the contention and solves the problem
> I saw (in testing so far).
> 
> Would this work also if the test had grace periods init/cleanup racing with
> preempted RCU read-side critical sections? I'm doing longer tests now to see how
> this performs under GP-stress, versus my solution. I am also seeing that with
> just the node lists, not per-cpu list, I see a dramatic throughput drop after
> some amount of time, but I can't explain it. And I do not see this with the
> per-cpu list solution (I'm currently testing if I see the same throughput drop
> with the fan-out solution you proposed).
> 
> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
> reasonable, considering this is not a default. Are you suggesting defaulting to
> this for small systems? If not, then I guess the optimization will not be
> enabled by default. Eventually, with this patch set, if we are moving forward
> with this approach, I will remove the config option for per-CPU block list
> altogether so that it is enabled by default. That's kind of my plan if we agreed
> on this, but it is just an RFC stage 🙂.

So the fanout solution works great when there are grace periods in progress. I
see no throughput drop, and consistent performance with read site critical
sections. However, if we switch to having no grace periods continuously
happening in progress, I can see the throughput dropping quite a bit here
(-30%). I can't explain that, but I do not see that issue with per-CPU lists.

With the per-cpu list scheme, blocking does not involve the node at all, as long
as there is no grace period in progress. So, in that sense, per-CPU blocked list
is completely detached from RCU - it is a bit like lazy RCU in the sense instead
of a callback, it is the blocking task in a per-cpu list, relieving RCU of the
burden.

Maybe the extra layer of the node tree (with fanout == 1) somehow adds
unnecessary overhead that does not exist with Per CPU lists? Even though there
is this throughput drop, it still does better than baseline with a common RCU node.

Based on this, I would say per-cpu blocked list is still worth doing. Thoughts?

 - Joel