[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAD=FV=WWUiCi6bZCs_gseFpDDWNkuJMoL6XCftEo6W7q6jRCkg@mail.gmail.com>
Date: Thu, 20 Mar 2025 09:06:37 -0700
From: Doug Anderson <dianders@...omium.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>,
linux-perf-users <linux-perf-users@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
"Liang, Kan" <kan.liang@...ux.intel.com>, Arnaldo Carvalho de Melo <acme@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>, Namhyung Kim <namhyung@...nel.org>,
Stephane Eranian <eranian@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Leo Yan <leo.yan@....com>,
James Clark <james.clark@...aro.org>, Will Deacon <will@...nel.org>,
Thomas Richter <tmricht@...ux.ibm.com>, Vince Weaver <vincent.weaver@...ne.edu>,
Petr Mladek <pmladek@...e.com>
Subject: Re: Remove the "perf" hard lock up detector (watchdog) from the kernel?
Hi,
+Petr who helped a bunch with getting the buddy watchdog integrated.
On Mon, Mar 17, 2025 at 2:26 PM Ian Rogers <irogers@...gle.com> wrote:
>
> Hi,
>
> The kernel tree has two hard lockup detectors. The perf one uses a
> perf counter to generate NMI interrupts and detect a lack of forward
> progress, whereas the buddy approach uses the soft lockup hrtimer to
> check the next CPU is progressing. Doug Anderson
> <dianders@...omium.org> recently questioned:
>
> https://lore.kernel.org/all/CAD=FV=WfB6inJPuwfhbw4mtFBYpr+3ot2J+SJAZ3pT3t4fW7cw@mail.gmail.com/
> ...but I'd also have to ask: is there a reason you're using the "perf"
> hard-lockup detector instead of the buddy one? In my mind, the "buddy"
> watchdog is better in almost all ways (I believe it's lower power,
> doesn't waste a "perf" controller, and doesn't suffer from frequency
> issues). It's even crossed my mind whether the "perf" lockup detector
> should be deprecated. ;-)
>
> In the perf tool there are warnings associated with the NMI watchdog.
> The metric code also has a flag on metrics where events aren't grouped
> when the NMI watchdog is enabled. For example:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json?h=perf-tools-next#n1916
>
> The warning and breaking of groups is currently inaccurate for the
> buddy hard lockup detector as /proc/sys/kernel/nmi_watchdog is still
> present to enable or disable the buddy detector. That is the perf tool
> is currently warning and breaking event groups stating the NMI
> watchdog is a problem but the kernel is configured to use the buddy
> watchdog.
>
> I'm unaware of a way to determine if the buddy or "perf" counter based
> approach is in use and to correct perf's behavior. A patch adding such
> an ability (say a new file in /proc/sys/kernel), and perhaps new
> abilities to switch watchdog at runtime, seem less desirable than just
> deleting the "perf" counter based hard lock up detector. The perf tool
> could make the NMI warnings and breaking of event groups conditional
> on the running kernel version then.
>
> Are there objections to just deleting the "perf" hard lock up detector
> (watchdog) from the kernel tree? Are there reasons to keep it around
> but just not default?
In the cover letter of the patches to land the buddy hardlockup
detector [1] I talked about some of the pros and cons of the buddy vs
the perf hardlockup detector. Pasting them here:
Overall, pros (+) and cons (-) of the buddy system compared to an
arch-specific hardlockup detector (which might be implemented using
perf):
+ The buddy system is usable on systems that don't have an
arch-specific hardlockup detector, like arm32 and arm64 (though it's
being worked on for arm64 [5]).
+ The buddy system may free up scarce hardware resources.
+ If a CPU totally goes out to lunch (can't process NMIs) the buddy
system could still detect the problem (though it would be unlikely
to be able to get a stack trace).
+ The buddy system uses the same timer function to pet the hardlockup
detector on the running CPU as it uses to detect hardlockups on
other CPUs. Compared to other hardlockup detectors, this means it
generates fewer interrupts and thus is likely better able to let
CPUs stay idle longer.
- If all CPUs are hard locked up at the same time the buddy system
can't detect it.
- If we don't have SMP we can't use the buddy system.
- The buddy system needs an arch-specific mechanism (possibly NMI
backtrace) to get info about the locked up CPU.
I'd expect that non-SMP systems are quite rare these days (and do they
really have NMI-enabled perf?). I'd also expect that most systems that
have NMI-enabled perf also can handle a NMI-enabled backtrace (arm64
was in that state for a while, but it was a "simple matter of
software" to fix it). That means that the only real downside of the
buddy detector is that it can't detect when all CPUs are locked up at
the same time. Off-list someone pointed out that could possibly happen
with certain classes of bugs where all CPUs could end up trying to
grab the same spinlock...
That all being said, though:
* Currently, switching between the two hardlockup detectors requires
recompiling the kernel since both can't coexist. From what I'm aware
of the pros/cons I'd think most people would want the "buddy"
detector. That probably at least means it should be the default.
* I have no idea if it's worth keeping the perf hardlockup detector
code around just for the small number of use cases where it could
catch a bug that the buddy one couldn't. My gut says it's not worth
keeping the "perf" hardlockup detector around.
* NOTE that we still need to support more than one type of
"hardlockup" detector because, from what I understand, the
arch-specific powerpc hardlockup detector _is_ superior to the buddy
lockup detector. See "HARDLOCKUP_DETECTOR_ARCH" in the code.
-Doug
[1] https://lore.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/
Powered by blists - more mailing lists