[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=UWkZx8xQD=jBkOO6h2f5tw_KCoqhHciw5hkEOYU=GM8A@mail.gmail.com>
Date: Tue, 7 Oct 2025 14:43:16 -0700
From: Doug Anderson <dianders@...omium.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Jinchao Wang <wangjinchao600@...il.com>,
Namhyung Kim <namhyung@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Will Deacon <will@...nel.org>, Yunhui Cui <cuiyunhui@...edance.com>, akpm@...ux-foundation.org,
catalin.marinas@....com, maddy@...ux.ibm.com, mpe@...erman.id.au,
npiggin@...il.com, christophe.leroy@...roup.eu, tglx@...utronix.de,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
acme@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com,
jolsa@...nel.org, adrian.hunter@...el.com, kan.liang@...ux.intel.com,
kees@...nel.org, masahiroy@...nel.org, aliceryhl@...gle.com, ojeda@...nel.org,
thomas.weissschuh@...utronix.de, xur@...gle.com, ruanjinjie@...wei.com,
gshan@...hat.com, maz@...nel.org, suzuki.poulose@....com,
zhanjie9@...ilicon.com, yangyicong@...ilicon.com, gautam@...ux.ibm.com,
arnd@...db.de, zhao.xichao@...o.com, rppt@...nel.org, lihuafei1@...wei.com,
coxu@...hat.com, jpoimboe@...nel.org, yaozhenguo1@...il.com,
luogengkun@...weicloud.com, max.kellermann@...os.com, tj@...nel.org,
yury.norov@...il.com, thorsten.blum@...ux.dev, x86@...nel.org,
linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linuxppc-dev@...ts.ozlabs.org, linux-perf-users@...r.kernel.org
Subject: Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector
Hi,
On Tue, Oct 7, 2025 at 1:43 PM Ian Rogers <irogers@...gle.com> wrote:
>
> On Tue, Oct 7, 2025 at 1:00 PM Doug Anderson <dianders@...omium.org> wrote:
> > On Mon, Oct 6, 2025 at 6:00 PM Ian Rogers <irogers@...gle.com> wrote:
> ..
> > > I don't mind a file that also says what kind of lockup detector is in
> > > use. I'd like the meaning of nmi_watchdog to keep meaning the kernel
> > > stole a perf counter as this is the behavior long assumed by perf
> > > which I think is the majority or only user of the file. I think the
> > > buddy watchdog being controlled by this file is less than intention
> > > revealing.
> >
> > I'm more than happy to be outvoted, but IMO nothing about the name
> > "nmi_watchdog" says to me that a perf counter was stolen. It just says
> > that there's a watchdog that NMIs are part of its work.
> >
> > ...and, indeed, "nmi_watchdog" doesn't use perf on PPC, right? As far
> > as I can tell, from reading `arch/powerpc/kernel/watchdog.c` checks
> > `watchdog_enabled & WATCHDOG_HARDLOCKUP_ENABLED`. ...and before I did
> > commit df95d3085caa ("watchdog/hardlockup: rename some "NMI watchdog"
> > constants/function") it was checking `watchdog_enabled &
> > NMI_WATCHDOG_ENABLED`. That was at least since 2019, I think...
> >
> > ...and you can see "PPC_WATCHDOG" depends on
> > "HARDLOCKUP_DETECTOR_ARCH", not "HARDLOCKUP_DETECTOR_PERF" so it's not
> > perf-backed.
> >
> > I think this is just the critical question: does/should "nmi_watchdog"
> > mean perf-backed detector or is "nmi_watchdog" just a synonym for
> > "hard lockup detector". In your mind it's the former and in my mind
> > it's the latter. The correct way forward depends on which way we want
> > to jump, right? In either case some external code is going to need to
> > change...
>
> So we could say it is one-idea against another, I'm hoping to be
> objective and come from the viewpoint of the perf code as part of its
> function is to be a demonstration of APIs and the buddy watchdog has
> altered this. The perf tool has referred to the nmi_watchdog's
> behavior in man pages and code since 2016:
> http://lkml.kernel.org/r/1459810686-15913-1-git-send-email-andi@firstfloor.org
> ```
> +Globally pinned events can limit the number of counters available for
> +other groups. On x86 systems, the NMI watchdog pins a counter by default.
> +The nmi watchdog can be disabled as root with
> +
> + echo 0 > /proc/sys/kernel/nmi_watchdog
> ```
> http://lkml.kernel.org/r/1464119559-17203-1-git-send-email-andi@firstfloor.org
> ```
> +void arch_topdown_group_warn(void)
> +{
> + fprintf(stderr,
> + "nmi_watchdog enabled with topdown. May give wrong results.\n"
> + "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
> +}
> ```
>
> Probably the most common error message dates back to 2017:
> http://lkml.kernel.org/r/20170211183218.ijnvb5f7ciyuunx4@pd.tnic
> ```
> Some events weren't counted. Try disabling the NMI watchdog:
> echo 0 > /proc/sys/kernel/nmi_watchdog
> perf stat ...
> echo 1 > /proc/sys/kernel/nmi_watchdog
> ```
> and that is saying "NMI watchdog" not "buddy watchdog". Users are
> familiar with the idea that the /proc/sys/kernel/nmi_watchdog is
> unremarkably controlling the NMI watchdog.
>
> I've not found another use of /proc/sys/kernel/nmi_watchdog outside of
> the perf tool.
Although it's possible I missed it, from my quick look in ChromeOS, I
don't see any use of `/proc/sys/kernel/nmi_watchdog` either. NOTE that
you also need to look for `sysctl` references to
`kernel.nmi_watchdog`, right? That's essentially a way to access the
exact same file...
The other thing you need to think about is the kernel command-line
parameter. Right now we have the kernel command line parameter
`nmi_watchdog=<0,1>` that can be used to turn the hardlockup detector
on or off. On PowerPC I believe it turns on/off the ARCH hardlockup
detector. On systems with the buddy hardlockup detector it turns
on/off the buddy hardlockup detector. How are you proposing to handle
that? Are you going to make the kernel command line parameter still
affect everyone, or require people using the kernel command line
parameter for buddy/powerpc to change?
All of these ways to turn off/on the hardlockup detector can, in
theory, also be things that end users are messing with. People may
have settings in their `/etc/sysctl.d` to tweak things. They may have
messed with the kernel command line in their bootloader to tweak this
setting. I'm not saying it's impossible to change, but anyone who
changes this has to be prepared for people to yell.
> > The kernel docs are certainly pretty ambiguous here. "kernel.rst"
> > certainly mentions perf, but it also says that "nmi_watchdog" is only
> > for x86 which is certainly not true. "lockup-watchdogs.rst" doesn't
> > say anything about perf and just seems to indicate that "nmi_watchdog"
> > is another name for the hardlockup detector. "kernel-parameters.txt"
> > makes some mention of tweaking which perf event would be used, but
> > otherwise makes it sound like the nmi_watchdog is just another name
> > for the hardlockup detector.
> >
> > My vote would be two files:
> >
> > * "nmi_watchdog" - simply a synonym for "hardlockup detector". 1
> > enabled, 0 if not (like today)
> >
> > * "hardlock_detector" - could be "buddy", "perf", or "arch"
> >
> > ...if the "hardlockup_detector" file doesn't exist and "nmi_watchdog"
> > is enabled, you could probably guess that a perf event was in use. If
> > the "hardlockup_detector" file exists then a perf event is only in use
> > if the value of that file is "perf". It doesn't feel terribly hard to
> > do this and it doesn't break anyone who was assuming they could turn
> > on the hardlockup detector with "/proc/nmi_watchdog".
>
> It is not hard but:
> 1) it means whenever perf wants to determine the NMI watchdog is
> present it needs to read two files rather than the existing 1, which
> has some runtime cost;
> 2) the name nmi_watchdog for controlling the behavior of the buddy
> watchdog isn't intention revealing as the buddy mechanisms whole point
> is to avoid the NMI;
The buddy mechanism is to avoid the use of a perf counter, not an NMI
(non maskable interrupt). "NMI" != "perf counter". Many types of
interrupts can be non-maskable (or pseudo-non-maskable). The buddy
lockup detector still uses an NMI (indirectly). Specifically, a
non-maskable IPI (NMI-backed IPI) is used to interrupt the other CPU.
You keep saying that perf wants to determine if the "NMI watchdog" is
present. That's not what it wants to know. It wants to know if the
hard lockup detector (which likely uses NMIs in some way) happens to
use up a perf event. If an NMI-backed hardlockup detector (presumably
like the one on PowerPC) wasn't using a perf counter then perf
wouldn't care.
> 3) old perf tools with the buddy watchdog have the wrong behavior
> (they've regressed).
>
> It is also not hard to have the watchdog files named
> <mechanism>_watchdog, such as buddy_watchdog, nmi_watchdog,
> arch_watchdog and have the contents be 0 or 1 as appropriate. Such a
> scheme would have less overhead for perf, make the name more intention
> revealing and not alter old perf tools. I'm really not sure what
> problem we're trying to fix by not adopting this approach. I was
> surprised the buddy watchdog merged but using the nmi_watchdog file.
The buddy watchdog was pretty much following the conventions that were
already in the code: that the hardlockup detector (whether backed by
perf or not) was essentially called the "nmi watchdog". There were a
number of people that were involved in reviews and I don't believe
suggesting creating a whole different mechanism for enabling /
disabling the buddy watchdog was never suggested.
-Doug
Powered by blists - more mailing lists