[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0adc7d14da0f21909eef68acf19fc5706a4b1af.camel@mediatek.com>
Date: Wed, 30 Oct 2024 03:55:56 +0000
From: Cheng-Jui Wang (王正睿)
<Cheng-Jui.Wang@...iatek.com>
To: "paulmck@...nel.org" <paulmck@...nel.org>
CC: "sumit.garg@...aro.org" <sumit.garg@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"dianders@...omium.org" <dianders@...omium.org>, "rostedt@...dmis.org"
<rostedt@...dmis.org>, "frederic@...nel.org" <frederic@...nel.org>,
wsd_upstream <wsd_upstream@...iatek.com>,
Bobule Chang (張弘義) <bobule.chang@...iatek.com>,
"mark.rutland@....com" <mark.rutland@....com>, "kernel-team@...a.com"
<kernel-team@...a.com>, "joel@...lfernandes.org" <joel@...lfernandes.org>,
"rcu@...r.kernel.org" <rcu@...r.kernel.org>
Subject: Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in
rcu_dump_cpu_stacks()
On Tue, 2024-10-29 at 09:29 -0700, Paul E. McKenney wrote:
> External email : Please do not click links or open attachments until you have verified the sender or the content.
>
>
> On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote:
> > On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote:
> > > The result is that the current leaf rcu_node structure's ->lock is
> > > acquired only if a stack backtrace might be needed from the current CPU,
> > > and is held across only that CPU's backtrace. As a result, if there are
> >
> > After upgrading our device to kernel-6.11, we encountered a lockup
> > scenario under stall warning.
> > I had prepared a patch to submit, but I noticed that this series has
> > already addressed some issues, though it hasn't been merged into the
> > mainline yet. So, I decided to reply to this series for discussion on
> > how to fix it before pushing. Here is the lockup scenario We
> > encountered:
> >
> > Devices: arm64 with only 8 cores
> > One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump
> > other CPUs, but it waits for the corresponding CPU to dump backtrace,
> > with a 10-second timeout.
> >
> > __delay()
> > __const_udelay()
> > nmi_trigger_cpumask_backtrace()
> > arch_trigger_cpumask_backtrace()
> > trigger_single_cpu_backtrace()
> > dump_cpu_task()
> > rcu_dump_cpu_stacks() <- holding rnp->lock
> > print_other_cpu_stall()
> > check_cpu_stall()
> > rcu_pending()
> > rcu_sched_clock_irq()
> > update_process_times()
> >
> > However, the other 7 CPUs are waiting for rnp->lock on the path to
> > report qs.
> >
> > queued_spin_lock_slowpath()
> > queued_spin_lock()
> > do_raw_spin_lock()
> > __raw_spin_lock_irqsave()
> > _raw_spin_lock_irqsave()
> > rcu_report_qs_rdp()
> > rcu_check_quiescent_state()
> > rcu_core()
> > rcu_core_si()
> > handle_softirqs()
> > __do_softirq()
> > ____do_softirq()
> > call_on_irq_stack()
> >
> > Since the arm64 architecture uses IPI instead of true NMI to implement
> > arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables
> > interrupts, which is enough to block this IPI request.
> > Therefore, if other CPUs start waiting for the lock before receiving
> > the IPI, a semi-deadlock scenario like the following occurs:
> >
> > CPU0 CPU1 CPU2
> > ----- ----- -----
> > lock_irqsave(rnp->lock)
> > lock_irqsave(rnp->lock)
> > <can't receive IPI>
> > <send ipi to CPU 1>
> > <wait CPU 1 for 10s>
> > lock_irqsave(rnp->lock)
> > <can't receive IPI>
> > <send ipi to CPU 2>
> > <wait CPU 2 for 10s>
> > ...
> >
> >
> > In our scenario, with 7 CPUs to dump, the lockup takes nearly 70
> > seconds, causing subsequent useful logs to be unable to print, leading
> > to a watchdog timeout and system reboot.
> >
> > This series of changes re-acquires the lock after each dump,
> > significantly reducing lock-holding time. However, since it still holds
> > the lock while dumping CPU backtrace, there's still a chance for two
> > CPUs to wait for each other for 10 seconds, which is still too long.
> > So, I would like to ask if it's necessary to dump backtrace within the
> > spinlock section?
> > If not, especially now that lockless checks are possible, maybe it can
> > be changed as follows?
> >
> > - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
> > - continue;
> > - raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
> > + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) {
> > if (cpu_is_offline(cpu))
> > pr_err("Offline CPU %d blocking current GP.\n", cpu);
> > else
> > dump_cpu_task(cpu);
> > }
> > }
> > - raw_spin_unlock_irqrestore_rcu_node(rnp,
> > flags);
> >
> > Or should this be considered an arm64 issue, and they should switch to
> > true NMI, otherwise, they shouldn't use
> > nmi_trigger_cpumask_backtrace()?
>
> Thank you for looking into this!
>
> We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so,
> yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like
> the name says. ;-)
In the comments of following patch, the arm64 maintainers have
differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a
bit confused and unsure which perspective is more reasonable.
https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/
> /*
> * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the
name,
> * nothing about it truly needs to be implemented using an NMI, it's
> * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi()
> * returned false our backtrace attempt will just use a regular IPI.
> */
> Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace()
> with normal interrupts (for example, on SoCs not implementing true NMIs),
> but have a short timeout (maybe a few jiffies?) after which its returns
> false (and presumably also cancels the backtrace request so that when
> the non-NMI interrupt eventually does happen, its handler simply returns
> without backtracing). This should be implemented using atomics to avoid
> deadlock issues. This alternative approach would provide accurate arm64
> backtraces in the common case where interrupts are enabled, but allow
> a graceful fallback to remote tracing otherwise.
>
> Would you be interested in working this issue, whatever solution the
> arm64 maintainers end up preferring?
>
The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace().
It is shared code and not architecture-specific. Currently, I haven't
thought of a feasible solution. I have also CC'd the authors of the
aforementioned patch to see if they have any other ideas.
Regarding the rcu stall warning, I think the purpose of acquiring `rnp-
>lock` is to protect the rnp->qsmask variable rather than to protect
the `dump_cpu_task()` operation, right?
Therefore, there is no need to call dump_cpu_task() while holding the
lock.
When holding the spinlock, we can store the CPUs that need to be dumped
into a cpumask, and then dump them all at once after releasing the
lock.
Here is my temporary solution used locally based on kernel-6.11.
+ cpumask_var_t mask;
+ bool mask_ok;
+ mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC);
rcu_for_each_leaf_node(rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
for_each_leaf_node_possible_cpu(rnp, cpu)
if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
{
if (cpu_is_offline(cpu))
pr_err("Offline CPU %d blocking
current GP.\n", cpu);
+ else if (mask_ok)
+ cpumask_set_cpu(cpu, mask);
else
dump_cpu_task(cpu);
}
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}
+ if (mask_ok) {
+ if (!trigger_cpumask_backtrace(mask)) {
+ for_each_cpu(cpu, mask)
+ dump_cpu_task(cpu);
+ }
+ free_cpumask_var(mask);
+ }
After applying this, I haven't encountered the lockup issue for five
days, whereas it used to occur about once a day.
> Thanx, Paul
Powered by blists - more mailing lists