linux-kernel - Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in rcu_dump_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <adb044e2-8f62-4367-9a22-30515f5647b1@paulmck-laptop>
Date: Tue, 29 Oct 2024 09:29:13 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Cheng-Jui Wang (王正睿) <Cheng-Jui.Wang@...iatek.com>
Cc: "frederic@...nel.org" <frederic@...nel.org>,
	"rcu@...r.kernel.org" <rcu@...r.kernel.org>,
	wsd_upstream <wsd_upstream@...iatek.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"kernel-team@...a.com" <kernel-team@...a.com>,
	Bobule Chang (張弘義) <bobule.chang@...iatek.com>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>,
	"joel@...lfernandes.org" <joel@...lfernandes.org>
Subject: Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in
 rcu_dump_cpu_stacks()

On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote:
> On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote:
> > The result is that the current leaf rcu_node structure's ->lock is
> > acquired only if a stack backtrace might be needed from the current CPU,
> > and is held across only that CPU's backtrace. As a result, if there are
> 
> After upgrading our device to kernel-6.11, we encountered a lockup
> scenario under stall warning. 
> I had prepared a patch to submit, but I noticed that this series has
> already addressed some issues, though it hasn't been merged into the
> mainline yet. So, I decided to reply to this series for discussion on
> how to fix it before pushing. Here is the lockup scenario We
> encountered:
> 
> Devices: arm64 with only 8 cores
> One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump
> other CPUs, but it waits for the corresponding CPU to dump backtrace,
> with a 10-second timeout.
> 
>    __delay()
>    __const_udelay()
>    nmi_trigger_cpumask_backtrace()
>    arch_trigger_cpumask_backtrace()
>    trigger_single_cpu_backtrace()
>    dump_cpu_task()
>    rcu_dump_cpu_stacks()  <- holding rnp->lock
>    print_other_cpu_stall()
>    check_cpu_stall()
>    rcu_pending()
>    rcu_sched_clock_irq()
>    update_process_times()
> 
> However, the other 7 CPUs are waiting for rnp->lock on the path to
> report qs.
> 
>    queued_spin_lock_slowpath()
>    queued_spin_lock()
>    do_raw_spin_lock()
>    __raw_spin_lock_irqsave()
>    _raw_spin_lock_irqsave()
>    rcu_report_qs_rdp()
>    rcu_check_quiescent_state()
>    rcu_core()
>    rcu_core_si()
>    handle_softirqs()
>    __do_softirq()
>    ____do_softirq()
>    call_on_irq_stack()
> 
> Since the arm64 architecture uses IPI instead of true NMI to implement
> arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables
> interrupts, which is enough to block this IPI request.
> Therefore, if other CPUs start waiting for the lock before receiving
> the IPI, a semi-deadlock scenario like the following occurs:
> 
> CPU0                    CPU1                    CPU2
> -----                   -----                   -----
> lock_irqsave(rnp->lock)
>                         lock_irqsave(rnp->lock)
>                         <can't receive IPI>
> <send ipi to CPU 1>
> <wait CPU 1 for 10s>
>                                                 lock_irqsave(rnp->lock)
>                                                 <can't receive IPI>
> <send ipi to CPU 2>
> <wait CPU 2 for 10s>
> ...
> 
> 
> In our scenario, with 7 CPUs to dump, the lockup takes nearly 70
> seconds, causing subsequent useful logs to be unable to print, leading
> to a watchdog timeout and system reboot.
> 
> This series of changes re-acquires the lock after each dump,
> significantly reducing lock-holding time. However, since it still holds
> the lock while dumping CPU backtrace, there's still a chance for two
> CPUs to wait for each other for 10 seconds, which is still too long.
> So, I would like to ask if it's necessary to dump backtrace within the
> spinlock section?
> If not, especially now that lockless checks are possible, maybe it can
> be changed as follows?
> 
> -			if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
> -				continue;
> -			raw_spin_lock_irqsave_rcu_node(rnp, flags);
> -			if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
> +			if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) {
> 				if (cpu_is_offline(cpu))
> 					pr_err("Offline CPU %d blocking current GP.\n", cpu);
> 				else
> 					dump_cpu_task(cpu);
> 				}
> 			}
> -			raw_spin_unlock_irqrestore_rcu_node(rnp,
> flags);
> 
> Or should this be considered an arm64 issue, and they should switch to
> true NMI, otherwise, they shouldn't use
> nmi_trigger_cpumask_backtrace()?

Thank you for looking into this!

We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so,
yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like
the name says.  ;-)

Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace()
with normal interrupts (for example, on SoCs not implementing true NMIs),
but have a short timeout (maybe a few jiffies?) after which its returns
false (and presumably also cancels the backtrace request so that when
the non-NMI interrupt eventually does happen, its handler simply returns
without backtracing).  This should be implemented using atomics to avoid
deadlock issues.  This alternative approach would provide accurate arm64
backtraces in the common case where interrupts are enabled, but allow
a graceful fallback to remote tracing otherwise.

Would you be interested in working this issue, whatever solution the
arm64 maintainers end up preferring?

							Thanx, Paul