linux-kernel - Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in rcu_dump_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <76d7657225b23fa1fd1c501c382b59c35458684a.camel@mediatek.com>
Date: Thu, 30 Oct 2025 08:30:41 +0000
From: Tze-nan Wu (吳澤南) <Tze-nan.Wu@...iatek.com>
To: "dianders@...omium.org" <dianders@...omium.org>, "paulmck@...nel.org"
	<paulmck@...nel.org>, Cheng-Jui Wang (王正睿)
	<Cheng-Jui.Wang@...iatek.com>
CC: Tze-nan Wu (吳澤南) <Tze-nan.Wu@...iatek.com>,
	Bobule Chang (張弘義) <bobule.chang@...iatek.com>,
	"kernel-team@...a.com" <kernel-team@...a.com>, "rostedt@...dmis.org"
	<rostedt@...dmis.org>, wsd_upstream <wsd_upstream@...iatek.com>,
	"frederic@...nel.org" <frederic@...nel.org>, "mark.rutland@....com"
	<mark.rutland@....com>, "sumit.garg@...aro.org" <sumit.garg@...aro.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"joel@...lfernandes.org" <joel@...lfernandes.org>, "rcu@...r.kernel.org"
	<rcu@...r.kernel.org>
Subject: Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in
 rcu_dump_cpu_stacks()

On Fri, 2024-11-01 at 06:55 -0700, Paul E. McKenney wrote:
> On Fri, Nov 01, 2024 at 01:44:15AM +0000, Cheng-Jui Wang (王正睿) wrote:
> > On Thu, 2024-10-31 at 14:27 -0700, Doug Anderson wrote:
> > > Hi,
> > > 
> > > On Wed, Oct 30, 2024 at 10:03 PM Paul E. McKenney
> > > <paulmck@...nel.org> wrote:
> > > > 
> > > > > > > Note that:
> > > > > > > 
> > > > > > > * Switching to real NMIs is impossible on many existing
> > > > > > > arm64 CPUs.
> > > > > > > The hardware support simply isn't there. Pseudo-NMIs
> > > > > > > should work fine
> > > > > > > and are (in nearly all cases) just as good as NMIs but
> > > > > > > they have a
> > > > > > > small performance impact. There are also compatibility
> > > > > > > issues with
> > > > > > > some pre-existing bootloaders. ...so code can't assume
> > > > > > > even Pseudo-NMI
> > > > > > > will work and needs to be able to fall back. Prior to my
> > > > > > > recent
> > > > > > > changes arm64 CPUs wouldn't even do stacktraces in some
> > > > > > > scenarios. Now
> > > > > > > at least they fall back to regular IPIs.
> > > > > > 
> > > > > > But those IPIs are of no help whatsoever when the target
> > > > > > CPU is spinning
> > > > > > with interrupts disabled, which is the scenario under
> > > > > > discussion.
> > > > > 
> > > > > Right that we can't trace the target CPU spinning with
> > > > > interrupts
> > > > > disabled.
> > > > 
> > > > You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU
> > > > requesting
> > > > the backtrace.  The resulting backtrace will not be as good as
> > > > you
> > > > would get from using an NMI, but if you don't have NMIs, it is
> > > > better
> > > > than nothing.
> > > > 
> > > > >           ...but in the case where NMI isn't available it
> > > > > _still_
> > > > > makes sense to make arch_trigger_cpumask_backtrace() fall
> > > > > back to IPI
> > > > > because:
> > > > > 
> > > > > 1. There are many cases where the trigger.*cpu.*backtrace()
> > > > > class of
> > > > > functions are used to trace CPUs that _aren't_ looping with
> > > > > interrupts
> > > > > disabled. We still want to be able to backtrace in that case.
> > > > > For
> > > > > instance, you can see in "panic.c" that there are cases where
> > > > > trigger_all_cpu_backtrace() is called. It's valuable to make
> > > > > cases
> > > > > like this (where there's no indication that a CPU is locked)
> > > > > actually
> > > > > work.
> > > > > 
> > > > > 2. Even if there's a CPU that's looping with interrupts
> > > > > disabled and
> > > > > we can't trace it because of no NMI, it could still be useful
> > > > > to get
> > > > > backtraces of other CPUs. It's possible that what the other
> > > > > CPUs are
> > > > > doing could give a hint about the CPU that's wedged. This
> > > > > isn't a
> > > > > great hint, but some info is better than no info.
> > > > 
> > > > And it also makes sense for an IRQ-based backtrace to fall back
> > > > to
> > > > something like the aforementioned
> > > > sched_show_task(cpu_curr(cpu)) if
> > > > the destination CPU has interrupts disabled.
> > > > 
> > > > > > Hence my suggestion that arm64, when using IRQs to request
> > > > > > stack
> > > > > > backtraces, have an additional short timeout (milliseconds)
> > > > > > in
> > > > > > order to fall back to remote stack tracing.  In many cases,
> > > > > > this
> > > > > > fallback happens automatically, as you can see in
> > > > > > dump_cpu_task().
> > > > > > If trigger_single_cpu_backtrace() returns false, the stack
> > > > > > is dumped
> > > > > > remotely.
> > > > > 
> > > > > I think the answer here is that the API needs to change. The
> > > > > whole
> > > > > boolean return value for trigger.*cpu.*backtrace() is mostly
> > > > > ignored
> > > > > by callers. Yes, you've pointed at one of the two places that
> > > > > it's not
> > > > > ignored and it tries a reasonable fallback, but all the other
> > > > > callers
> > > > > just silently do nothing when the function returns false.
> > > > > That led to
> > > > > many places where arm64 devices were simply not getting _any_
> > > > > stacktrace.
> > > > > 
> > > > > Perhaps we need some type of API that says "I actually have a
> > > > > fallback, so if you don't think the stracktrace will succeed
> > > > > then it's
> > > > > OK to return false".
> > > > 
> > > > Boolean is fine for trigger_single_cpu_backtrace(), which is
> > > > directed at
> > > > a single CPU.  And the one call to this function that does not
> > > > currently
> > > > check its return value could just call dump_cpu_task() instead.
> > > > 
> > > > There are only 20 or so uses of all of these functions, so the
> > > > API can
> > > > change, give or take the pain involved in changing
> > > > architecture-specific
> > > > code.
> > > 
> > > Right. Falling back to "sched_show_task(cpu_curr(cpu))" or
> > > something
> > > similar if the IPI doesn't go through is a good idea. Indeed,
> > > falling
> > > back to that if the NMI doesn't go through is _also_ a good idea,
> > > right? ...and we don't want to change all 20 callers to all add a
> > > fallback. To that end, it feels like someone should change it so
> > > that
> > > the common code includes the fallback and we get rid of the
> > > boolean
> > > return value.
> > > 
> > > > > > > * Even if we decided that we absolutely had to disable
> > > > > > > stacktrades on
> > > > > > > arm64 CPUs without some type of NMI, that won't fix
> > > > > > > arm32. arm32 has
> > > > > > > been using regular IPIs for backtraces like this for
> > > > > > > many, many years.
> > > > > > > Maybe folks don't care as much about arm32 anymore but it
> > > > > > > feels bad if
> > > > > > > we totally break it.
> > > > > > 
> > > > > > Because arm32 isn't used much for datacenter workloads, it
> > > > > > will not
> > > > > > be suffering from this issue.  Not enough to have motivated
> > > > > > anyone to
> > > > > > fix it, anyway.  In contrast, in the datacenter, CPUs
> > > > > > really can and
> > > > > > do have interrupts disabled for many seconds.  (No, this is
> > > > > > not a good
> > > > > > thing, but it is common enough for us to need to avoid
> > > > > > disabling other
> > > > > > debugging facilities.)
> > > > > > 
> > > > > > So it might well be that arm32 will continue to do just
> > > > > > fine with
> > > > > > IRQ-based stack tracing.  Or maybe someday arm32 will also
> > > > > > need to deal
> > > > > > with this same issue.  But no "maybe" for arm64, given its
> > > > > > increasing
> > > > > > use in the datacenter.
> > > > > 
> > > > > IMO the answer here is that anyone in datacenters should just
> > > > > turn on
> > > > > pseudo NMI (or NMI on newer arm64 CPUs).
> > > > 
> > > > Suppose datacenters all do this.  What part of this issue
> > > > remains?
> > > 
> > > I believe that if datacenters enable pseudo NMIs then the issue
> > > is
> > > "fixed" and thus only remains for anyone else (datacenters or
> > > other
> > > users of Linux) that don't turn on pseudo NMIs.
> > > 
> > > 
> > > > > > > IMO waiting 10 seconds for a backtrace is pretty crazy to
> > > > > > > begin with.
> > > > > > > What about just changing that globally to 1 second? If
> > > > > > > not, it doesn't
> > > > > > > feel like it would be impossibly hard to make an arch-
> > > > > > > specific
> > > > > > > callback to choose the time and that callback could even
> > > > > > > take into
> > > > > > > account whether we managed to get an NMI. I'd be happy to
> > > > > > > review such
> > > > > > > a patch.
> > > > > > 
> > > > > > Let's please keep the 10-second timeout for NMI-based
> > > > > > backtraces.
> > > > > > 
> > > > > > But I take it that you are not opposed to a shorter timeout
> > > > > > for the
> > > > > > special case of IRQ-based stack backtrace requests?
> > > > > 
> > > > > IMO the 10 second is still awfully long (HW watchdogs can
> > > > > trigger
> > > > > during that time), but I'm not that upset by it. I'm OK with
> > > > > shorter
> > > > > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/
> > > > > timeouts
> > > > > measured in milliseconds unless the caller has actually said
> > > > > that they
> > > > > can handle a "false" return or the caller says that it's more
> > > > > important to be quick than to wait for a long time.
> > > > 
> > > > If you are using NMIs, and the CPU doesn't respond in 10
> > > > seconds,
> > > > something is likely broken.  Maybe bad firmware or bad
> > > > hardware.  You are
> > > > right that watchdogs might trigger, but they are likely already
> > > > triggering
> > > > due to the target CPU being so firmly locked up that it is not
> > > > responding
> > > > even to NMIs.
> > > 
> > > Yeah, if NMIs aren't working then things are pretty bad. It still
> > > seems like it would be better to let Linux dump more info before
> > > a
> > > hardware watchdog triggers. For instance it could perhaps call
> > > something akin to "sched_show_task(cpu_curr(cpu))".
> > > 
> > > I don't feel super strongly about it, but in my mind even if a
> > > CPU is
> > > unresponsive to NMIs for 1-2 seconds then things are in pretty
> > > bad
> > > shape and I'd want to start dumping some more info before the
> > > hardware
> > > watchdog kicks in and we can't dump anything.
> > > 
> > > 
> > > > On the other hand, when you are not using NMIs, then I agree
> > > > you want a shorter timeout before remotely tracing the staek
> > > > via
> > > > sched_show_task(cpu_curr(cpu)).  I picked a few milliseconds,
> > > > but if
> > > > datacenters are using real NMIs or pseudo NMIs (and thus are
> > > > not being
> > > > blocked by normal interrupt disabling), then I must defer to
> > > > others on
> > > > the default timeout.  Much depends on the workload and
> > > > configuration.
> > > 
> > > As I said, I don't feel strongly about this, so if people want to
> > > keep
> > > the timeout or shorten it or add a knob, any of those would be
> > > fine
> > > with me. I personally would object to a timeout that's _too_
> > > short or
> > > a timeout that is longer than the current 10 seconds.
> > > 
> > > -Doug
> > 
> > I hope this fix can be applied to the stable kernels. Modifying an
> > API
> > that is called by many architectures may encounter difficulties
> > during
> > the backporting process.
> > 
> > How about this: we keep the original
> > nmi_trigger_cpumask_backtrace()
> > but add a __nmi_trigger_cpumask_backtrace() in the middle that can
> > accept a timeout parameter.
> > 
> > 
> > +#define NMI_BACKTRACE_TIMEOUT (10 * 1000)
> > 
> > void nmi_trigger_cpumask_backtrace(...)
> > +{
> > +	__nmi_trigger_cpumask_backtrace(...,
> > NMI_BACKTRACE_TIMEOUT);
> > +}
> > +
> > +void __nmi_trigger_cpumask_backtrace(..., unsigned long timeout)
> > {
> > ...
> > -	for (i = 0; i < 10 * 1000; i++) {
> > +	for (i = 0; i < timeout; i++) {
> > 
> > 
> > Then, the arch_trigger_cpumask_backtrace() for different
> > architectures
> > can pass in their desired timeout, for example:
> > 
> > 	/* 1 seconds if fallbacked to IPI */
> > 	timeout = has_nmi() ? NMI_BACKTRACE_TIMEOUT : 1000;
> > 	__nmi_trigger_cpu_mask_backtrace(..., timeout);
> > 
> > This way, archiectures that want different timeouts can modify it
> > themselves without having to implement complex logic on their own.
> 
> Why not leave the current 10-second timeout (which is needed to
> handle
> completely locked-up CPUs), and then add logic to the arm64 code that
> does the substitution of the plain interrupt for the NMI?  If needed,
> arm32 can do the same thing.
> 
> That way, we don't need to change the common-code API, we don't need
> coordinated changes among multiple architectures, and architectures
> using real NMIs need not change.
> 
> 							Thanx, Paul

I ran into a same issue and have an idea base on Paul's suggestion.

How about this, Let's check the "backtrace_mask" earlier in the
architecture-specific callback (arm64_backtrace_ipi) used by
`nmi_trigger_cpumask_backtrace`, and also shorten the timeout to one
second.
Once a timeout occurs, we clear the "backtrace_mask" to avoid
`nmi_trigger_cpumask_backtrace` getting stuck for ten seconds in such
scenarios.
This approach does not require modifying any common-code API.

File: arch/arm64/kernel/smp.c:
 static void arm64_backtrace_ipi(cpumask_t *mask)  {
        __ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);
+
+       if (!system_uses_irq_prio_masking()) {
+               /* Wait for 1 second for all CPUs to do the backtrace
*/
+               for (int i = 0; i < 1000; i++) {
+                       if (cpumask_empty(mask))
+                               return;
+                       mdelay(1);
+                       touch_softlockup_watchdog();
+               }
+               /*
+                * clear the cpumask so that we don't wait on cpu that
did
+                * not response ipi again in
nmi_trigger_cpumask_backtrace
+                */
+               cpumask_clear(mask);
+       }
 }

If this change is acceptable, I would be happy to submit a formal
patch.
Before I do that, I'd like to get some input from the maintainers here.

---
Thanks, Tze-nan