linux-kernel - Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=VNmjTVxcxgTQqjE7CTkK2NVGbRxFJSwv=yOHU8gj-urQ@mail.gmail.com>
Date: Mon, 6 Oct 2025 16:24:44 -0700
From: Doug Anderson <dianders@...omium.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Jinchao Wang <wangjinchao600@...il.com>, Namhyung Kim <namhyung@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Will Deacon <will@...nel.org>, 
	Yunhui Cui <cuiyunhui@...edance.com>, akpm@...ux-foundation.org, 
	catalin.marinas@....com, maddy@...ux.ibm.com, mpe@...erman.id.au, 
	npiggin@...il.com, christophe.leroy@...roup.eu, tglx@...utronix.de, 
	mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, 
	acme@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com, 
	jolsa@...nel.org, adrian.hunter@...el.com, kan.liang@...ux.intel.com, 
	kees@...nel.org, masahiroy@...nel.org, aliceryhl@...gle.com, ojeda@...nel.org, 
	thomas.weissschuh@...utronix.de, xur@...gle.com, ruanjinjie@...wei.com, 
	gshan@...hat.com, maz@...nel.org, suzuki.poulose@....com, 
	zhanjie9@...ilicon.com, yangyicong@...ilicon.com, gautam@...ux.ibm.com, 
	arnd@...db.de, zhao.xichao@...o.com, rppt@...nel.org, lihuafei1@...wei.com, 
	coxu@...hat.com, jpoimboe@...nel.org, yaozhenguo1@...il.com, 
	luogengkun@...weicloud.com, max.kellermann@...os.com, tj@...nel.org, 
	yury.norov@...il.com, thorsten.blum@...ux.dev, x86@...nel.org, 
	linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
	linuxppc-dev@...ts.ozlabs.org, linux-perf-users@...r.kernel.org
Subject: Re: [RFC PATCH V1] watchdog: Add boot-time selection for hard lockup detector

Hi,

On Mon, Oct 6, 2025 at 2:30 PM Ian Rogers <irogers@...gle.com> wrote:
>
> On Tue, Sep 16, 2025 at 11:14 PM Jinchao Wang <wangjinchao600@...il.com> wrote:
> >
> > On Tue, Sep 16, 2025 at 10:35:46PM -0700, Namhyung Kim wrote:
> > > Hello,
> > >
> > > On Tue, Sep 16, 2025 at 10:13:12PM -0700, Ian Rogers wrote:
> > > > On Tue, Sep 16, 2025 at 6:47 PM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > > >
> > > > > On Tue, Sep 16, 2025 at 05:03:48PM -0700, Ian Rogers wrote:
> > > > > > On Tue, Sep 16, 2025 at 7:51 AM Jinchao Wang <wangjinchao600@...il.com> wrote:
> > > > > > >
> > > > > > > Currently, the hard lockup detector is selected at compile time via
> > > > > > > Kconfig, which requires a kernel rebuild to switch implementations.
> > > > > > > This is inflexible, especially on systems where a perf event may not
> > > > > > > be available or may be needed for other tasks.
> > > > > > >
> > > > > > > This commit refactors the hard lockup detector to replace a rigid
> > > > > > > compile-time choice with a flexible build-time and boot-time solution.
> > > > > > > The patch supports building the kernel with either detector
> > > > > > > independently, or with both. When both are built, a new boot parameter
> > > > > > > `hardlockup_detector="perf|buddy"` allows the selection at boot time.
> > > > > > > This is a more robust and user-friendly design.
> > > > > > >
> > > > > > > This patch is a follow-up to the discussion on the kernel mailing list
> > > > > > > regarding the preference and future of the hard lockup detectors. It
> > > > > > > implements a flexible solution that addresses the community's need to
> > > > > > > select an appropriate detector at boot time.
> > > > > > >
> > > > > > > The core changes are:
> > > > > > > - The `perf` and `buddy` watchdog implementations are separated into
> > > > > > >   distinct functions (e.g., `watchdog_perf_hardlockup_enable`).
> > > > > > > - Global function pointers are introduced (`watchdog_hardlockup_enable_ptr`)
> > > > > > >   to serve as a single API for the entire feature.
> > > > > > > - A new `hardlockup_detector=` boot parameter is added to allow the
> > > > > > >   user to select the desired detector at boot time.
> > > > > > > - The Kconfig options are simplified by removing the complex
> > > > > > >   `HARDLOCKUP_DETECTOR_PREFER_BUDDY` and allowing both detectors to be
> > > > > > >   built without mutual exclusion.
> > > > > > > - The weak stubs are updated to call the new function pointers,
> > > > > > >   centralizing the watchdog logic.
> > > > > >
> > > > > > What is the impact on  /proc/sys/kernel/nmi_watchdog ? Is that
> > > > > > enabling and disabling whatever the boot time choice was? I'm not sure
> > > > > > why this has to be a boot time option given the ability to configure
> > > > > > via /proc/sys/kernel/nmi_watchdog.
> > > > > The new hardlockup_detector boot parameter and the existing
> > > > > /proc/sys/kernel/nmi_watchdog file serve different purposes.
> > > > >
> > > > > The boot parameter selects the type of hard lockup detector (perf or buddy).
> > > > > This choice is made once at boot.
> > > > >
> > > > >  /proc/sys/kernel/nmi_watchdog, on the other hand, is only a simple on/off
> > > > > switch for the currently selected detector. It does not change the detector's
> > > > > type.
> > > >
> > > > So the name "nmi_watchdog" for the buddy watchdog is wrong for fairly
> > > > obvious naming reasons but also because we can't differentiate when a
> > > > perf event has been taken or not - this impacts perf that is choosing
> > > > not to group events in metrics because of it, reducing the metric's
> > > > accuracy. We need an equivalent "buddy_watchdog" file to the
> > > > "nmi_watchdog" file. If we have such a file then if I did "echo 1 >
> > > > /proc/sys/kernel/nmi_watchdog" I'd expect the buddy watchdog to be
> > > > disabled and the perf event one to be enabled. Similarly, if I did
> > > > "echo 1 > /proc/sys/kernel/buddy_watchdog" then I would expect the
> > > > perf event watchdog to be disabled and the buddy one enabled. If I did
> > > >  "echo 0 > /proc/sys/kernel/nmi_watchdog; echo 0 >
> > > > /proc/sys/kernel/buddy_watchdog" then I'd expect neither to be
> > > > enabled. I don't see why choosing the type of watchdog implementation
> > > > at boot time is particularly desirable. It seems sensible to default
> > > > normal people to using the buddy watchdog (more perf events, power...)
> > > > and  CONFIG_DEBUG_KERNEL type people to using the perf event one. As
> > > > the "nmi_watchdog" file may be assumed to control the buddy watchdog,
> > > > perhaps a compatibility option (where the "nmi_watchdog" file controls
> > > > the buddy watchdog) is needed so that user code has time to migrate.
> > >
> > > Sounds good to me.  For perf tools, it'd be great if we can have a run-
> > > time check which watchdog is selected.
> > Considering backward compatibility, I prefer to keep
> > /proc/sys/kernel/nmi_watchdog and introduce a new file called
> > /proc/sys/kernel/hardlockup_detector_type, which only shows the default string
> > or the boot parameter.
>
> Is there code using the buddy watchdog that cares about the
> /proc/sys/kernel/nmi_watchdog file? My assumption is that everything
> except Android is using the NMI watchdog, so a new
> /proc/sys/kernel/buddy_watchdog file doesn't impact them.

Buddy watchdog has been out there for a few years. At Google, I know
it's used by everything except Android. AKA I believe it is used in
Google's servers and also in ChromeOS. Both of those (presumably)
enable the buddy watchdog via calling it "nmi_watchdog". It's possible
that some Android phones are using the buddy watchdog too but I'm not
aware of it. I don't believe Pixel is using it, though that could
change in the future.

IMO at this point "nmi watchdog" is simply a synonym for the
hardlockup detector. That was what it looked like in the code before I
started messing around and adding the buddy lockup detector and it's
how it is now. While it's unfortunate that there are two names for the
same thing, I don't personally think that should change at this point.
FWIW, even the "buddy" watchdog relies on NMIs to actually get stack
crawls on stuck cores, so NMI still means something even there.

If we want to tell between the perf detector or the buddy detector we
should add a separate file for it.

> On Android
> writing to /proc/sys/kernel/nmi_watchdog would switch from updating
> the buddy watchdog enable/disable to the NMI watchdog enable/disable,
> but it is a straightforward patch to make anything doing this update
> the buddy_watchdog file instead.

Straightforward, but you've got to go find everyone that you break by
doing this. That's not something I want responsibility for. If you
want to convince others this is something worthwhile and you've got
someone signed up to deal with the fallout (if any) then I won't
object, but it's not something I'd be in support of.


> If we have to keep "nmi_watchdog" then we should deprecate it and
> create an equivalent file with a better name (ie without NMI in it).
> It'll be moderately annoying in perf to determine whether the NMI
> watchdog is enabled by having to read two files.

Again, up to you if you want to try to do this, but I'm not really in
support of it. It doesn't seem terribly hard to make a new file that
says which hardlockup detector is in use. If that file exists then
read it. If not then you fallback to what you have today.

-Doug