[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <013102c7-a1c8-4f13-8dd9-5803e4d69aaf@nvidia.com>
Date: Thu, 22 Jan 2026 20:29:41 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: paulmck@...nel.org
Cc: linux-kernel@...r.kernel.org, Boqun Feng <boqun.feng@...il.com>,
rcu@...r.kernel.org, Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
Josh Triplett <josh@...htriplett.org>, Uladzislau Rezki <urezki@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang@...ux.dev>
Subject: Re: [PATCH -next v3 2/3] rcu/nocb: Remove dead callback overload
handling
On Jan 22, 2026, at 7:18 PM, Paul E. McKenney <paulmck@...nel.org> wrote:
> On Thu, Jan 22, 2026 at 06:43:31PM -0500, Joel Fernandes wrote:
>> On Thu, Jan 22, 2026 at 01:55:11PM -0800, Paul E. McKenney wrote:
>>> On Mon, Jan 19, 2026 at 06:12:22PM -0500, Joel Fernandes wrote:
>>>> - } else if (len > rdp->qlen_last_fqs_check + qhimark) {
>>>> - /* ... or if many callbacks queued. */
>>>> - rdp->qlen_last_fqs_check = len;
>>>> - j = jiffies;
>>>> - if (j != rdp->nocb_gp_adv_time &&
>>>> - rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
>>> This places in cur_gp_seq not the grace period for the current callback
>>> (which would be unlikely to have finished), but rather the grace period
>>> for the oldest callback that has not yet been marked as done. And that
>>> callback started some time ago, and thus might well have finished.
>>> So while this code might not have been executed in your tests, it is
>>> definitely not a logical contradiction.
>>> Or am I missing something subtle here?
>>
>> You're right that it's not a logical contradiction - I was imprecise.
>> rcu_segcblist_nextgp() returns the GP for the oldest pending callback,
>> which could indeed have completed.
>>
>> However, the question becomes: under what scenario do we need to advance
>> here? If that GP completed, rcuog should have already advanced those
>> callbacks. The only way this code path can execute is if rcuog is starved
>> and not running to advance them, right?
>
> That is one way. The other way is if the RCU grace-period gets delayed
> (perhaps by vCPU preemption) between the time that it updates the
> leaf rcu_node structure's ->gp_seq field and the time that it invokes
> rcu_nocb_gp_cleanup().
I see the window you're describing. In rcu_gp_cleanup(), for each leaf node:
WRITE_ONCE(rnp->gp_seq, new_gp_seq); // GP appears complete
...
raw_spin_unlock_irq_rcu_node(rnp);
/* vCPU preemption */
rcu_nocb_gp_cleanup(sq); // wakes rcuog
So yes, in this window, the call_rcu() CPU could see the updated gp_seq
and have rcu_seq_done() return true for the now-completed GP.
However, even in this window, advancing callbacks doesn't help:
1. We advance callbacks from WAIT to DONE state
2. But rcuog is still sleeping, waiting for GP kthread to wake it
3. rcuoc is still sleeping, waiting for rcuog to wake it
4. Callbacks sit in DONE state but nobody invokes them
So the critical path is unchanged:
swake_up_all() -> rcuog -> rcuoc -> invoke.
I guess this is the redundancy argument - the window exists, but
exploiting it provides no meaningful benefit AFAICS.
>
>> But as Frederic pointed out, even if rcuog is starved, advancing here
>> doesn't help - rcuog must still run anyway to wake the callback thread.
>> We're just duplicating work it will do when it finally gets to run.
>
> So maybe we don't want that first patch after all? ;-)
Do you mean we want the first patch so that it can remove the code that
we don't want?
>
>> The extensive testing (300K callback floods, hours of rcutorture) showing
>> zero hits confirms this window is practically unreachable. I can update the
>> commit message to remove the "logical contradiction" claim and focus on the
>> redundancy argument instead.
>
> That would definitely be good!
Thanks. I will focus on this argument, then. I will resend with a better
patch description in the morning.
>
>> Would that address your concern?
>
> Your point about the rcuoc kthread needing to be awakened is a good one.
> I am still concerned about flooding on busy systems, especially if the
> busy component is an underlying hypervisor, but we might need a more
> principled approach for that situation.
Hmm true. There is also the case that any of the kthreads in the way of
the callback getting preempted by the hypervisor could also be
problematic, to your point of requiring a more principled approach. I
guess we did not want the reader side vCPU preemption workarounds either
for similar reason.
One trick I found irrespective of virtualization, is, rcu_nocb_poll can
result in grace periods completing faster. I think this could help
overload situations by retiring callbacks sooner than later. I can
experiment with this idea in future. Was considering a dynamic trigger
to enable polling mode in overload. I guess there is one way to find out
how well this will work, but initial testing does look promising. :-D.
--
Joel Fernandes
Powered by blists - more mailing lists