linux-kernel - Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=U3ic707dLuUc+NfxtWF6-ZyRdE0OY2VA6TgvgWKCHUzg@mail.gmail.com>
Date: Tue, 7 Oct 2025 12:54:08 -0700
From: Doug Anderson <dianders@...omium.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Jinchao Wang <wangjinchao600@...il.com>, 
	Namhyung Kim <namhyung@...nel.org>, Peter Zijlstra <peterz@...radead.org>, 
	Will Deacon <will@...nel.org>, Yunhui Cui <cuiyunhui@...edance.com>, akpm@...ux-foundation.org, 
	catalin.marinas@....com, maddy@...ux.ibm.com, mpe@...erman.id.au, 
	npiggin@...il.com, christophe.leroy@...roup.eu, tglx@...utronix.de, 
	mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, 
	acme@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com, 
	jolsa@...nel.org, adrian.hunter@...el.com, kan.liang@...ux.intel.com, 
	kees@...nel.org, masahiroy@...nel.org, aliceryhl@...gle.com, ojeda@...nel.org, 
	thomas.weissschuh@...utronix.de, xur@...gle.com, ruanjinjie@...wei.com, 
	gshan@...hat.com, maz@...nel.org, suzuki.poulose@....com, 
	zhanjie9@...ilicon.com, yangyicong@...ilicon.com, gautam@...ux.ibm.com, 
	arnd@...db.de, zhao.xichao@...o.com, rppt@...nel.org, lihuafei1@...wei.com, 
	coxu@...hat.com, jpoimboe@...nel.org, yaozhenguo1@...il.com, 
	luogengkun@...weicloud.com, max.kellermann@...os.com, tj@...nel.org, 
	yury.norov@...il.com, thorsten.blum@...ux.dev, x86@...nel.org, 
	linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
	linuxppc-dev@...ts.ozlabs.org, linux-perf-users@...r.kernel.org
Subject: Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector

Hi,

On Mon, Oct 6, 2025 at 6:00 PM Ian Rogers <irogers@...gle.com> wrote:
>
> On Mon, Oct 6, 2025 at 4:32 PM Doug Anderson <dianders@...omium.org> wrote:
> >
> > Hi,
> >
> > On Mon, Oct 6, 2025 at 2:30 PM Ian Rogers <irogers@...gle.com> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 11:14 PM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 10:35:46PM -0700, Namhyung Kim wrote:
> > > > > Hello,
> > > > >
> > > > > On Tue, Sep 16, 2025 at 10:13:12PM -0700, Ian Rogers wrote:
> > > > > > On Tue, Sep 16, 2025 at 6:47 PM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 16, 2025 at 05:03:48PM -0700, Ian Rogers wrote:
> > > > > > > > On Tue, Sep 16, 2025 at 7:51 AM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > > > > > > >
> > > > > > > > > Currently, the hard lockup detector is selected at compile time via
> > > > > > > > > Kconfig, which requires a kernel rebuild to switch implementations.
> > > > > > > > > This is inflexible, especially on systems where a perf event may not
> > > > > > > > > be available or may be needed for other tasks.
> > > > > > > > >
> > > > > > > > > This commit refactors the hard lockup detector to replace a rigid
> > > > > > > > > compile-time choice with a flexible build-time and boot-time solution.
> > > > > > > > > The patch supports building the kernel with either detector
> > > > > > > > > independently, or with both. When both are built, a new boot parameter
> > > > > > > > > `hardlockup_detector="perf|buddy"` allows the selection at boot time.
> > > > > > > > > This is a more robust and user-friendly design.
> > > > > > > > >
> > > > > > > > > This patch is a follow-up to the discussion on the kernel mailing list
> > > > > > > > > regarding the preference and future of the hard lockup detectors. It
> > > > > > > > > implements a flexible solution that addresses the community's need to
> > > > > > > > > select an appropriate detector at boot time.
> > > > > > > > >
> > > > > > > > > The core changes are:
> > > > > > > > > - The `perf` and `buddy` watchdog implementations are separated into
> > > > > > > > >   distinct functions (e.g., `watchdog_perf_hardlockup_enable`).
> > > > > > > > > - Global function pointers are introduced (`watchdog_hardlockup_enable_ptr`)
> > > > > > > > >   to serve as a single API for the entire feature.
> > > > > > > > > - A new `hardlockup_detector=` boot parameter is added to allow the
> > > > > > > > >   user to select the desired detector at boot time.
> > > > > > > > > - The Kconfig options are simplified by removing the complex
> > > > > > > > >   `HARDLOCKUP_DETECTOR_PREFER_BUDDY` and allowing both detectors to be
> > > > > > > > >   built without mutual exclusion.
> > > > > > > > > - The weak stubs are updated to call the new function pointers,
> > > > > > > > >   centralizing the watchdog logic.
> > > > > > > >
> > > > > > > > What is the impact on  /proc/sys/kernel/nmi_watchdog ? Is that
> > > > > > > > enabling and disabling whatever the boot time choice was? I'm not sure
> > > > > > > > why this has to be a boot time option given the ability to configure
> > > > > > > > via /proc/sys/kernel/nmi_watchdog.
> > > > > > > The new hardlockup_detector boot parameter and the existing
> > > > > > > /proc/sys/kernel/nmi_watchdog file serve different purposes.
> > > > > > >
> > > > > > > The boot parameter selects the type of hard lockup detector (perf or buddy).
> > > > > > > This choice is made once at boot.
> > > > > > >
> > > > > > >  /proc/sys/kernel/nmi_watchdog, on the other hand, is only a simple on/off
> > > > > > > switch for the currently selected detector. It does not change the detector's
> > > > > > > type.
> > > > > >
> > > > > > So the name "nmi_watchdog" for the buddy watchdog is wrong for fairly
> > > > > > obvious naming reasons but also because we can't differentiate when a
> > > > > > perf event has been taken or not - this impacts perf that is choosing
> > > > > > not to group events in metrics because of it, reducing the metric's
> > > > > > accuracy. We need an equivalent "buddy_watchdog" file to the
> > > > > > "nmi_watchdog" file. If we have such a file then if I did "echo 1 >
> > > > > > /proc/sys/kernel/nmi_watchdog" I'd expect the buddy watchdog to be
> > > > > > disabled and the perf event one to be enabled. Similarly, if I did
> > > > > > "echo 1 > /proc/sys/kernel/buddy_watchdog" then I would expect the
> > > > > > perf event watchdog to be disabled and the buddy one enabled. If I did
> > > > > >  "echo 0 > /proc/sys/kernel/nmi_watchdog; echo 0 >
> > > > > > /proc/sys/kernel/buddy_watchdog" then I'd expect neither to be
> > > > > > enabled. I don't see why choosing the type of watchdog implementation
> > > > > > at boot time is particularly desirable. It seems sensible to default
> > > > > > normal people to using the buddy watchdog (more perf events, power...)
> > > > > > and  CONFIG_DEBUG_KERNEL type people to using the perf event one. As
> > > > > > the "nmi_watchdog" file may be assumed to control the buddy watchdog,
> > > > > > perhaps a compatibility option (where the "nmi_watchdog" file controls
> > > > > > the buddy watchdog) is needed so that user code has time to migrate.
> > > > >
> > > > > Sounds good to me.  For perf tools, it'd be great if we can have a run-
> > > > > time check which watchdog is selected.
> > > > Considering backward compatibility, I prefer to keep
> > > > /proc/sys/kernel/nmi_watchdog and introduce a new file called
> > > > /proc/sys/kernel/hardlockup_detector_type, which only shows the default string
> > > > or the boot parameter.
> > >
> > > Is there code using the buddy watchdog that cares about the
> > > /proc/sys/kernel/nmi_watchdog file? My assumption is that everything
> > > except Android is using the NMI watchdog, so a new
> > > /proc/sys/kernel/buddy_watchdog file doesn't impact them.
> >
> > Buddy watchdog has been out there for a few years. At Google, I know
> > it's used by everything except Android. AKA I believe it is used in
> > Google's servers and also in ChromeOS. Both of those (presumably)
> > enable the buddy watchdog via calling it "nmi_watchdog". It's possible
> > that some Android phones are using the buddy watchdog too but I'm not
> > aware of it. I don't believe Pixel is using it, though that could
> > change in the future.
>
> I thought what motivated the buddy watchdog was patches implementing
> this approach on Android for ARM that lacked an NMI based hard lockup
> detector?

It's probably not worth a full rehash of the history, but suffice to
say that the buddy lockup detector did originally come out of Android
but isn't currently used there as far as I can tell.


> Anyway, while the buddy watchdog is in use by Google servers the
> nmi_watchdog file has an actively detrimental effect on that.
> Specifically the nmi_watchdog file having the value of "1" disables
> certain event groups for certain metrics in the perf tool as it is
> assumed there are too few performance counters due to the NMI watchdog
> stealing one. We want groups of events so that events are scheduled
> together and metrics are more accurate, we don't want groups that fail
> to schedule because of a lack of counters.

Fair enough, but that code could also change.


> > IMO at this point "nmi watchdog" is simply a synonym for the
> > hardlockup detector. That was what it looked like in the code before I
> > started messing around and adding the buddy lockup detector and it's
> > how it is now. While it's unfortunate that there are two names for the
> > same thing, I don't personally think that should change at this point.
> > FWIW, even the "buddy" watchdog relies on NMIs to actually get stack
> > crawls on stuck cores, so NMI still means something even there.
>
> I think this is misguided. Currently the only use of nmi_watchdog I'm
> aware of is by perf where the buddy watchdog's use of it causes issues
> (as mentioned above).
>
> > If we want to tell between the perf detector or the buddy detector we
> > should add a separate file for it.
>
> We could with perf then having to read from two files rather than one.
> Presumably the lack of presence of one file will be sufficient to also
> avoid a kernel version check.

This was what I was assuming


> > > On Android
> > > writing to /proc/sys/kernel/nmi_watchdog would switch from updating
> > > the buddy watchdog enable/disable to the NMI watchdog enable/disable,
> > > but it is a straightforward patch to make anything doing this update
> > > the buddy_watchdog file instead.
> >
> > Straightforward, but you've got to go find everyone that you break by
> > doing this. That's not something I want responsibility for. If you
> > want to convince others this is something worthwhile and you've got
> > someone signed up to deal with the fallout (if any) then I won't
> > object, but it's not something I'd be in support of.
>
> Stuff like this happens, check out this thread:
> https://lore.kernel.org/lkml/2025020304-chip-trench-4e56@gregkh/
> Imo we shouldn't design in using an actively misleading file name and
> incurring extra overhead in perf. Having two files nmi_watchdog and
> buddy_watchdog is fine as the latter case currently isn't in
> mainstream distro use and people shouldn't care. It also maintains and
> correct's perf's behavior when the buddy and not nmi watchdog is in
> use.
>
> > > If we have to keep "nmi_watchdog" then we should deprecate it and
> > > create an equivalent file with a better name (ie without NMI in it).
> > > It'll be moderately annoying in perf to determine whether the NMI
> > > watchdog is enabled by having to read two files.
> >
> > Again, up to you if you want to try to do this, but I'm not really in
> > support of it. It doesn't seem terribly hard to make a new file that
> > says which hardlockup detector is in use. If that file exists then
> > read it. If not then you fallback to what you have today.
>
> I don't mind a file that also says what kind of lockup detector is in
> use. I'd like the meaning of nmi_watchdog to keep meaning the kernel
> stole a perf counter as this is the behavior long assumed by perf
> which I think is the majority or only user of the file. I think the
> buddy watchdog being controlled by this file is less than intention
> revealing.

I'm more than happy to be outvoted, but IMO nothing about the name
"nmi_watchdog" says to me that a perf counter was stolen. It just says
that there's a watchdog that NMIs are part of its work.

...and, indeed, "nmi_watchdog" doesn't use perf on PPC, right? As far
as I can tell, from reading `arch/powerpc/kernel/watchdog.c` checks
`watchdog_enabled & WATCHDOG_HARDLOCKUP_ENABLED`. ...and before I did
commit df95d3085caa ("watchdog/hardlockup: rename some "NMI watchdog"
constants/function") it was checking `watchdog_enabled &
NMI_WATCHDOG_ENABLED`. That was at least since 2019, I think...

...and you can see "PPC_WATCHDOG" depends on
"HARDLOCKUP_DETECTOR_ARCH", not "HARDLOCKUP_DETECTOR_PERF" so it's not
perf-backed.

I think this is just the critical question: does/should "nmi_watchdog"
mean perf-backed detector or is "nmi_watchdog" just a synonym for
"hard lockup detector". In your mind it's the former and in my mind
it's the latter. The correct way forward depends on which way we want
to jump, right? In either case some external code is going to need to
change...

The kernel docs are certainly pretty ambiguous here. "kernel.rst"
certainly mentions perf, but it also says that "nmi_watchdog" is only
for x86 which is certainly not true. "lockup-watchdogs.rst" doesn't
say anything about perf and just seems to indicate that "nmi_watchdog"
is another name for the hardlockup detector.  "kernel-parameters.txt"
makes some mention of tweaking which perf event would be used, but
otherwise makes it sound like the nmi_watchdog is just another name
for the hardlockup detector.

My vote would be two files:

* "nmi_watchdog" - simply a synonym for "hardlockup detector". 1
enabled, 0 if not (like today)

* "hardlock_detector" - could be "buddy", "perf", or "arch"

...if the "hardlockup_detector" file doesn't exist and "nmi_watchdog"
is enabled, you could probably guess that a perf event was in use. If
the "hardlockup_detector" file exists then a perf event is only in use
if the value of that file is "perf". It doesn't feel terribly hard to
do this and it doesn't break anyone who was assuming they could turn
on the hardlockup detector with "/proc/nmi_watchdog".

-Doug