[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d54fe26d-0f11-e422-d5f3-841c663b9d6f@infradead.org>
Date: Fri, 21 Apr 2023 16:59:27 -0700
From: Randy Dunlap <rdunlap@...radead.org>
To: Douglas Anderson <dianders@...omium.org>,
Petr Mladek <pmladek@...e.com>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Lecopzer Chen <lecopzer.chen@...iatek.com>,
Daniel Thompson <daniel.thompson@...aro.org>,
Stephen Boyd <swboyd@...omium.org>,
Chen-Yu Tsai <wens@...e.org>,
linux-arm-kernel@...ts.infradead.org,
kgdb-bugreport@...ts.sourceforge.net,
Marc Zyngier <maz@...nel.org>,
linux-perf-users@...r.kernel.org,
Mark Rutland <mark.rutland@....com>,
Masayoshi Mizuma <msys.mizuma@...il.com>,
Will Deacon <will@...nel.org>, ito-yuichi@...itsu.com,
Sumit Garg <sumit.garg@...aro.org>,
Catalin Marinas <catalin.marinas@....com>,
Colin Cross <ccross@...roid.com>,
Matthias Kaehlcke <mka@...omium.org>,
Guenter Roeck <groeck@...omium.org>,
Tzung-Bi Shih <tzungbi@...omium.org>,
Alexander Potapenko <glider@...gle.com>,
AngeloGioacchino Del Regno
<angelogioacchino.delregno@...labora.com>,
Dan Williams <dan.j.williams@...el.com>,
Geert Uytterhoeven <geert+renesas@...der.be>,
Ingo Molnar <mingo@...nel.org>,
John Ogness <john.ogness@...utronix.de>,
Josh Poimboeuf <jpoimboe@...nel.org>,
Juergen Gross <jgross@...e.com>,
Kees Cook <keescook@...omium.org>,
Laurent Dufour <ldufour@...ux.ibm.com>,
Liam Howlett <liam.howlett@...cle.com>,
Marco Elver <elver@...gle.com>,
Matthias Brugger <matthias.bgg@...il.com>,
Michael Ellerman <mpe@...erman.id.au>,
Miguel Ojeda <ojeda@...nel.org>,
Nathan Chancellor <nathan@...nel.org>,
Nick Desaulniers <ndesaulniers@...gle.com>,
"Paul E. McKenney" <paulmck@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Sami Tolvanen <samitolvanen@...gle.com>,
Stefano Stabellini <sstabellini@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>,
Zhaoyang Huang <zhaoyang.huang@...soc.com>,
Zhen Lei <thunder.leizhen@...wei.com>,
linux-kernel@...r.kernel.org, linux-mediatek@...ts.infradead.org
Subject: Re: [PATCH] hardlockup: detect hard lockups using secondary (buddy)
cpus
Hi--
On 4/21/23 15:53, Douglas Anderson wrote:
> From: Colin Cross <ccross@...roid.com>
>
> Implement a hardlockup detector that can be enabled on SMP systems
> that don't have an arch provided one or one implemented atop perf by
Is that one or more
?
> using interrupts on other cpus. Each cpu will use its softlockup
> hrtimer to check that the next cpu is processing hrtimer interrupts by
> verifying that a counter is increasing.
>
> NOTE: unlike the other hard lockup detectors, the buddy one can't
> easily provide a backtrace on the CPU that locked up. It relies on
> some other mechanism in the system to get information about the locked
> up CPUs. This could be support for NMI backtraces like [1], it could
> be a mechanism for printing the PC of locked CPUs like [2], or it
> could be something else.
>
> This style of hardlockup detector originated in some downstream
> Android trees and has been rebased on / carried in ChromeOS trees for
> quite a long time for use on arm and arm64 boards. Historically on
> these boards we've leveraged mechanism [2] to get information about
> hung CPUs, but we could move to [1].
>
> NOTE: the buddy system is not really useful to enable on any
> architectures that have a better mechanism. On arm64 folks have been
> trying to get a better mechanism for years and there has even been
> recent posts of patches adding support [3]. However, nothing about the
> buddy system is tied to arm64 and several archs (even arm32, where it
> was originally developed) could find it useful.
>
> [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org
> [2] https://issuetracker.google.com/172213129
> [3] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/
>
> Signed-off-by: Colin Cross <ccross@...roid.com>
> Signed-off-by: Matthias Kaehlcke <mka@...omium.org>
> Signed-off-by: Guenter Roeck <groeck@...omium.org>
> Signed-off-by: Tzung-Bi Shih <tzungbi@...omium.org>
> Signed-off-by: Douglas Anderson <dianders@...omium.org>
> ---
> This patch has been rebased in ChromeOS kernel trees many times, and
> each time someone had to do work on it they added their
> Signed-off-by. I've included those here. I've also left the author as
> Colin Cross since the core code is still his.
>
> I'll also note that the CC list is pretty giant, but that's what
> get_maintainers came up with (plus a few other folks I thought would
> be interested). As far as I can tell, there's no true MAINTAINER
> listed for the existing watchdog code. Assuming people don't hate
> this, maybe it would go through Andrew Morton's tree?
>
> include/linux/nmi.h | 18 ++++-
> kernel/Makefile | 1 +
> kernel/watchdog.c | 24 ++++--
> kernel/watchdog_buddy_cpu.c | 141 ++++++++++++++++++++++++++++++++++++
> lib/Kconfig.debug | 19 ++++-
> 5 files changed, 192 insertions(+), 11 deletions(-)
> create mode 100644 kernel/watchdog_buddy_cpu.c
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 39d1d93164bd..9eb86bc9f5ee 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1036,6 +1036,9 @@ config HARDLOCKUP_DETECTOR_PERF
> config HARDLOCKUP_CHECK_TIMESTAMP
> bool
>
> +config HARDLOCKUP_DETECTOR_CORE
> + bool
> +
> #
> # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
> # lockup detector rather than the perf based detector.
> @@ -1045,6 +1048,7 @@ config HARDLOCKUP_DETECTOR
> depends on DEBUG_KERNEL && !S390
> depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH
> select LOCKUP_DETECTOR
> + select HARDLOCKUP_DETECTOR_CORE
> select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF
> help
> Say Y here to enable the kernel to act as a watchdog to detect
> @@ -1055,9 +1059,22 @@ config HARDLOCKUP_DETECTOR
> chance to run. The current stack trace is displayed upon detection
> and the system will stay locked up.
>
> +config HARDLOCKUP_DETECTOR_BUDDY_CPU
> + bool "Buddy CPU hardlockup detector"
> + depends on DEBUG_KERNEL && SMP
> + depends on !HARDLOCKUP_DETECTOR && !HAVE_NMI_WATCHDOG
> + depends on !S390
> + select HARDLOCKUP_DETECTOR_CORE
> + select SOFTLOCKUP_DETECTOR
> + help
> + Say Y here to enable a hardlockup detector where CPUs check
> + each other for lockup. Each cpu uses its softlockup hrtimer
Preferably CPU
> + to check that the next cpu is processing hrtimer interrupts by
and CPU
> + verifying that a counter is increasing.
> +
> config BOOTPARAM_HARDLOCKUP_PANIC
> bool "Panic (Reboot) On Hard Lockups"
> - depends on HARDLOCKUP_DETECTOR
> + depends on HARDLOCKUP_DETECTOR_CORE
> help
> Say Y here to enable the kernel to panic on "hard lockups",
> which are bugs that cause the kernel to loop in kernel
--
~Randy
Powered by blists - more mailing lists