linux-kernel - Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fW64xHEW+4dKU_voNv7E67nUOFm27FFBuhtFii52NiQUQ@mail.gmail.com>
Date: Mon, 6 Oct 2025 18:00:09 -0700
From: Ian Rogers <irogers@...gle.com>
To: Doug Anderson <dianders@...omium.org>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: Jinchao Wang <wangjinchao600@...il.com>, Namhyung Kim <namhyung@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Will Deacon <will@...nel.org>, 
	Yunhui Cui <cuiyunhui@...edance.com>, akpm@...ux-foundation.org, 
	catalin.marinas@....com, maddy@...ux.ibm.com, mpe@...erman.id.au, 
	npiggin@...il.com, christophe.leroy@...roup.eu, tglx@...utronix.de, 
	mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, 
	acme@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com, 
	jolsa@...nel.org, adrian.hunter@...el.com, kan.liang@...ux.intel.com, 
	kees@...nel.org, masahiroy@...nel.org, aliceryhl@...gle.com, ojeda@...nel.org, 
	thomas.weissschuh@...utronix.de, xur@...gle.com, ruanjinjie@...wei.com, 
	gshan@...hat.com, maz@...nel.org, suzuki.poulose@....com, 
	zhanjie9@...ilicon.com, yangyicong@...ilicon.com, gautam@...ux.ibm.com, 
	arnd@...db.de, zhao.xichao@...o.com, rppt@...nel.org, lihuafei1@...wei.com, 
	coxu@...hat.com, jpoimboe@...nel.org, yaozhenguo1@...il.com, 
	luogengkun@...weicloud.com, max.kellermann@...os.com, tj@...nel.org, 
	yury.norov@...il.com, thorsten.blum@...ux.dev, x86@...nel.org, 
	linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
	linuxppc-dev@...ts.ozlabs.org, linux-perf-users@...r.kernel.org
Subject: Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector

On Mon, Oct 6, 2025 at 4:32 PM Doug Anderson <dianders@...omium.org> wrote:
>
> Hi,
>
> On Mon, Oct 6, 2025 at 2:30 PM Ian Rogers <irogers@...gle.com> wrote:
> >
> > On Tue, Sep 16, 2025 at 11:14 PM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 10:35:46PM -0700, Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > On Tue, Sep 16, 2025 at 10:13:12PM -0700, Ian Rogers wrote:
> > > > > On Tue, Sep 16, 2025 at 6:47 PM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 16, 2025 at 05:03:48PM -0700, Ian Rogers wrote:
> > > > > > > On Tue, Sep 16, 2025 at 7:51 AM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > > > > > >
> > > > > > > > Currently, the hard lockup detector is selected at compile time via
> > > > > > > > Kconfig, which requires a kernel rebuild to switch implementations.
> > > > > > > > This is inflexible, especially on systems where a perf event may not
> > > > > > > > be available or may be needed for other tasks.
> > > > > > > >
> > > > > > > > This commit refactors the hard lockup detector to replace a rigid
> > > > > > > > compile-time choice with a flexible build-time and boot-time solution.
> > > > > > > > The patch supports building the kernel with either detector
> > > > > > > > independently, or with both. When both are built, a new boot parameter
> > > > > > > > `hardlockup_detector="perf|buddy"` allows the selection at boot time.
> > > > > > > > This is a more robust and user-friendly design.
> > > > > > > >
> > > > > > > > This patch is a follow-up to the discussion on the kernel mailing list
> > > > > > > > regarding the preference and future of the hard lockup detectors. It
> > > > > > > > implements a flexible solution that addresses the community's need to
> > > > > > > > select an appropriate detector at boot time.
> > > > > > > >
> > > > > > > > The core changes are:
> > > > > > > > - The `perf` and `buddy` watchdog implementations are separated into
> > > > > > > >   distinct functions (e.g., `watchdog_perf_hardlockup_enable`).
> > > > > > > > - Global function pointers are introduced (`watchdog_hardlockup_enable_ptr`)
> > > > > > > >   to serve as a single API for the entire feature.
> > > > > > > > - A new `hardlockup_detector=` boot parameter is added to allow the
> > > > > > > >   user to select the desired detector at boot time.
> > > > > > > > - The Kconfig options are simplified by removing the complex
> > > > > > > >   `HARDLOCKUP_DETECTOR_PREFER_BUDDY` and allowing both detectors to be
> > > > > > > >   built without mutual exclusion.
> > > > > > > > - The weak stubs are updated to call the new function pointers,
> > > > > > > >   centralizing the watchdog logic.
> > > > > > >
> > > > > > > What is the impact on  /proc/sys/kernel/nmi_watchdog ? Is that
> > > > > > > enabling and disabling whatever the boot time choice was? I'm not sure
> > > > > > > why this has to be a boot time option given the ability to configure
> > > > > > > via /proc/sys/kernel/nmi_watchdog.
> > > > > > The new hardlockup_detector boot parameter and the existing
> > > > > > /proc/sys/kernel/nmi_watchdog file serve different purposes.
> > > > > >
> > > > > > The boot parameter selects the type of hard lockup detector (perf or buddy).
> > > > > > This choice is made once at boot.
> > > > > >
> > > > > >  /proc/sys/kernel/nmi_watchdog, on the other hand, is only a simple on/off
> > > > > > switch for the currently selected detector. It does not change the detector's
> > > > > > type.
> > > > >
> > > > > So the name "nmi_watchdog" for the buddy watchdog is wrong for fairly
> > > > > obvious naming reasons but also because we can't differentiate when a
> > > > > perf event has been taken or not - this impacts perf that is choosing
> > > > > not to group events in metrics because of it, reducing the metric's
> > > > > accuracy. We need an equivalent "buddy_watchdog" file to the
> > > > > "nmi_watchdog" file. If we have such a file then if I did "echo 1 >
> > > > > /proc/sys/kernel/nmi_watchdog" I'd expect the buddy watchdog to be
> > > > > disabled and the perf event one to be enabled. Similarly, if I did
> > > > > "echo 1 > /proc/sys/kernel/buddy_watchdog" then I would expect the
> > > > > perf event watchdog to be disabled and the buddy one enabled. If I did
> > > > >  "echo 0 > /proc/sys/kernel/nmi_watchdog; echo 0 >
> > > > > /proc/sys/kernel/buddy_watchdog" then I'd expect neither to be
> > > > > enabled. I don't see why choosing the type of watchdog implementation
> > > > > at boot time is particularly desirable. It seems sensible to default
> > > > > normal people to using the buddy watchdog (more perf events, power...)
> > > > > and  CONFIG_DEBUG_KERNEL type people to using the perf event one. As
> > > > > the "nmi_watchdog" file may be assumed to control the buddy watchdog,
> > > > > perhaps a compatibility option (where the "nmi_watchdog" file controls
> > > > > the buddy watchdog) is needed so that user code has time to migrate.
> > > >
> > > > Sounds good to me.  For perf tools, it'd be great if we can have a run-
> > > > time check which watchdog is selected.
> > > Considering backward compatibility, I prefer to keep
> > > /proc/sys/kernel/nmi_watchdog and introduce a new file called
> > > /proc/sys/kernel/hardlockup_detector_type, which only shows the default string
> > > or the boot parameter.
> >
> > Is there code using the buddy watchdog that cares about the
> > /proc/sys/kernel/nmi_watchdog file? My assumption is that everything
> > except Android is using the NMI watchdog, so a new
> > /proc/sys/kernel/buddy_watchdog file doesn't impact them.
>
> Buddy watchdog has been out there for a few years. At Google, I know
> it's used by everything except Android. AKA I believe it is used in
> Google's servers and also in ChromeOS. Both of those (presumably)
> enable the buddy watchdog via calling it "nmi_watchdog". It's possible
> that some Android phones are using the buddy watchdog too but I'm not
> aware of it. I don't believe Pixel is using it, though that could
> change in the future.

I thought what motivated the buddy watchdog was patches implementing
this approach on Android for ARM that lacked an NMI based hard lockup
detector?
Anyway, while the buddy watchdog is in use by Google servers the
nmi_watchdog file has an actively detrimental effect on that.
Specifically the nmi_watchdog file having the value of "1" disables
certain event groups for certain metrics in the perf tool as it is
assumed there are too few performance counters due to the NMI watchdog
stealing one. We want groups of events so that events are scheduled
together and metrics are more accurate, we don't want groups that fail
to schedule because of a lack of counters.

> IMO at this point "nmi watchdog" is simply a synonym for the
> hardlockup detector. That was what it looked like in the code before I
> started messing around and adding the buddy lockup detector and it's
> how it is now. While it's unfortunate that there are two names for the
> same thing, I don't personally think that should change at this point.
> FWIW, even the "buddy" watchdog relies on NMIs to actually get stack
> crawls on stuck cores, so NMI still means something even there.

I think this is misguided. Currently the only use of nmi_watchdog I'm
aware of is by perf where the buddy watchdog's use of it causes issues
(as mentioned above).

> If we want to tell between the perf detector or the buddy detector we
> should add a separate file for it.

We could with perf then having to read from two files rather than one.
Presumably the lack of presence of one file will be sufficient to also
avoid a kernel version check.

> > On Android
> > writing to /proc/sys/kernel/nmi_watchdog would switch from updating
> > the buddy watchdog enable/disable to the NMI watchdog enable/disable,
> > but it is a straightforward patch to make anything doing this update
> > the buddy_watchdog file instead.
>
> Straightforward, but you've got to go find everyone that you break by
> doing this. That's not something I want responsibility for. If you
> want to convince others this is something worthwhile and you've got
> someone signed up to deal with the fallout (if any) then I won't
> object, but it's not something I'd be in support of.

Stuff like this happens, check out this thread:
https://lore.kernel.org/lkml/2025020304-chip-trench-4e56@gregkh/
Imo we shouldn't design in using an actively misleading file name and
incurring extra overhead in perf. Having two files nmi_watchdog and
buddy_watchdog is fine as the latter case currently isn't in
mainstream distro use and people shouldn't care. It also maintains and
correct's perf's behavior when the buddy and not nmi watchdog is in
use.

> > If we have to keep "nmi_watchdog" then we should deprecate it and
> > create an equivalent file with a better name (ie without NMI in it).
> > It'll be moderately annoying in perf to determine whether the NMI
> > watchdog is enabled by having to read two files.
>
> Again, up to you if you want to try to do this, but I'm not really in
> support of it. It doesn't seem terribly hard to make a new file that
> says which hardlockup detector is in use. If that file exists then
> read it. If not then you fallback to what you have today.

I don't mind a file that also says what kind of lockup detector is in
use. I'd like the meaning of nmi_watchdog to keep meaning the kernel
stole a perf counter as this is the behavior long assumed by perf
which I think is the majority or only user of the file. I think the
buddy watchdog being controlled by this file is less than intention
revealing.

Thanks,
Ian

> -Doug