lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <adb044e2-8f62-4367-9a22-30515f5647b1@paulmck-laptop>
Date: Tue, 29 Oct 2024 09:29:13 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Cheng-Jui Wang (王正睿) <Cheng-Jui.Wang@...iatek.com>
Cc: "frederic@...nel.org" <frederic@...nel.org>,
	"rcu@...r.kernel.org" <rcu@...r.kernel.org>,
	wsd_upstream <wsd_upstream@...iatek.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"kernel-team@...a.com" <kernel-team@...a.com>,
	Bobule Chang (張弘義) <bobule.chang@...iatek.com>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>,
	"joel@...lfernandes.org" <joel@...lfernandes.org>
Subject: Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in
 rcu_dump_cpu_stacks()

On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote:
> On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote:
> > The result is that the current leaf rcu_node structure's ->lock is
> > acquired only if a stack backtrace might be needed from the current CPU,
> > and is held across only that CPU's backtrace. As a result, if there are
> 
> After upgrading our device to kernel-6.11, we encountered a lockup
> scenario under stall warning. 
> I had prepared a patch to submit, but I noticed that this series has
> already addressed some issues, though it hasn't been merged into the
> mainline yet. So, I decided to reply to this series for discussion on
> how to fix it before pushing. Here is the lockup scenario We
> encountered:
> 
> Devices: arm64 with only 8 cores
> One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump
> other CPUs, but it waits for the corresponding CPU to dump backtrace,
> with a 10-second timeout.
> 
>    __delay()
>    __const_udelay()
>    nmi_trigger_cpumask_backtrace()
>    arch_trigger_cpumask_backtrace()
>    trigger_single_cpu_backtrace()
>    dump_cpu_task()
>    rcu_dump_cpu_stacks()  <- holding rnp->lock
>    print_other_cpu_stall()
>    check_cpu_stall()
>    rcu_pending()
>    rcu_sched_clock_irq()
>    update_process_times()
> 
> However, the other 7 CPUs are waiting for rnp->lock on the path to
> report qs.
> 
>    queued_spin_lock_slowpath()
>    queued_spin_lock()
>    do_raw_spin_lock()
>    __raw_spin_lock_irqsave()
>    _raw_spin_lock_irqsave()
>    rcu_report_qs_rdp()
>    rcu_check_quiescent_state()
>    rcu_core()
>    rcu_core_si()
>    handle_softirqs()
>    __do_softirq()
>    ____do_softirq()
>    call_on_irq_stack()
> 
> Since the arm64 architecture uses IPI instead of true NMI to implement
> arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables
> interrupts, which is enough to block this IPI request.
> Therefore, if other CPUs start waiting for the lock before receiving
> the IPI, a semi-deadlock scenario like the following occurs:
> 
> CPU0                    CPU1                    CPU2
> -----                   -----                   -----
> lock_irqsave(rnp->lock)
>                         lock_irqsave(rnp->lock)
>                         <can't receive IPI>
> <send ipi to CPU 1>
> <wait CPU 1 for 10s>
>                                                 lock_irqsave(rnp->lock)
>                                                 <can't receive IPI>
> <send ipi to CPU 2>
> <wait CPU 2 for 10s>
> ...
> 
> 
> In our scenario, with 7 CPUs to dump, the lockup takes nearly 70
> seconds, causing subsequent useful logs to be unable to print, leading
> to a watchdog timeout and system reboot.
> 
> This series of changes re-acquires the lock after each dump,
> significantly reducing lock-holding time. However, since it still holds
> the lock while dumping CPU backtrace, there's still a chance for two
> CPUs to wait for each other for 10 seconds, which is still too long.
> So, I would like to ask if it's necessary to dump backtrace within the
> spinlock section?
> If not, especially now that lockless checks are possible, maybe it can
> be changed as follows?
> 
> -			if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
> -				continue;
> -			raw_spin_lock_irqsave_rcu_node(rnp, flags);
> -			if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
> +			if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) {
> 				if (cpu_is_offline(cpu))
> 					pr_err("Offline CPU %d blocking current GP.\n", cpu);
> 				else
> 					dump_cpu_task(cpu);
> 				}
> 			}
> -			raw_spin_unlock_irqrestore_rcu_node(rnp,
> flags);
> 
> Or should this be considered an arm64 issue, and they should switch to
> true NMI, otherwise, they shouldn't use
> nmi_trigger_cpumask_backtrace()?

Thank you for looking into this!

We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so,
yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like
the name says.  ;-)

Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace()
with normal interrupts (for example, on SoCs not implementing true NMIs),
but have a short timeout (maybe a few jiffies?) after which its returns
false (and presumably also cancels the backtrace request so that when
the non-NMI interrupt eventually does happen, its handler simply returns
without backtracing).  This should be implemented using atomics to avoid
deadlock issues.  This alternative approach would provide accurate arm64
backtraces in the common case where interrupts are enabled, but allow
a graceful fallback to remote tracing otherwise.

Would you be interested in working this issue, whatever solution the
arm64 maintainers end up preferring?

							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ