lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=V0tzeqCrFUrytbe0OByYkC23i61H+jdgZRXfMKbShMcA@mail.gmail.com>
Date:   Mon, 1 May 2023 07:04:46 -0700
From:   Doug Anderson <dianders@...omium.org>
To:     Randy Dunlap <rdunlap@...radead.org>
Cc:     Petr Mladek <pmladek@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Mark Rutland <mark.rutland@....com>,
        linux-arm-kernel@...ts.infradead.org,
        Stephane Eranian <eranian@...gle.com>,
        Sumit Garg <sumit.garg@...aro.org>,
        Chen-Yu Tsai <wens@...e.org>, linux-perf-users@...r.kernel.org,
        Marc Zyngier <maz@...nel.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Lecopzer Chen <lecopzer.chen@...iatek.com>,
        Daniel Thompson <daniel.thompson@...aro.org>,
        kgdb-bugreport@...ts.sourceforge.net, ito-yuichi@...itsu.com,
        ravi.v.shankar@...el.com, Masayoshi Mizuma <msys.mizuma@...il.com>,
        ricardo.neri@...el.com, Ian Rogers <irogers@...gle.com>,
        Stephen Boyd <swboyd@...omium.org>,
        Colin Cross <ccross@...roid.com>,
        Matthias Kaehlcke <mka@...omium.org>,
        Guenter Roeck <groeck@...omium.org>,
        Tzung-Bi Shih <tzungbi@...omium.org>,
        Alexander Potapenko <glider@...gle.com>,
        AngeloGioacchino Del Regno 
        <angelogioacchino.delregno@...labora.com>,
        David Gow <davidgow@...gle.com>,
        Geert Uytterhoeven <geert+renesas@...der.be>,
        Ingo Molnar <mingo@...nel.org>,
        Juergen Gross <jgross@...e.com>,
        Kees Cook <keescook@...omium.org>,
        Laurent Dufour <ldufour@...ux.ibm.com>,
        Liam Howlett <liam.howlett@...cle.com>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Matthias Brugger <matthias.bgg@...il.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Miguel Ojeda <ojeda@...nel.org>,
        Nick Desaulniers <ndesaulniers@...gle.com>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Rasmus Villemoes <linux@...musvillemoes.dk>,
        Sami Tolvanen <samitolvanen@...gle.com>,
        Stefano Stabellini <sstabellini@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Zhaoyang Huang <zhaoyang.huang@...soc.com>,
        Zhen Lei <thunder.leizhen@...wei.com>,
        linux-kernel@...r.kernel.org, linux-mediatek@...ts.infradead.org
Subject: Re: [PATCH v2] hardlockup: detect hard lockups using secondary
 (buddy) CPUs

Hi,


On Fri, Apr 28, 2023 at 5:36 PM Randy Dunlap <rdunlap@...radead.org> wrote:
>
> Hi--
>
> On 4/28/23 16:37, Douglas Anderson wrote:
> > From: Colin Cross <ccross@...roid.com>
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily show what's happening on the CPU that locked up just by doing a
> > simple backtrace. It relies on some other mechanism in the system to
> > get information about the locked up CPUs. This could be support for
> > NMI backtraces like [1], it could be a mechanism for printing the PC
> > of locked CPUs at panic time like [2] / [3], or it could be something
> > else. Even though that means we still rely on arch-specific code, this
> > arch-specific code seems to often be implemented even on architectures
> > that don't have a hardlockup detector.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] / [3] to get information
> > about hung CPUs, but we could move to [1].
> >
> > Although the original motivation for the buddy system was for use on
> > systems without an arch-specific hardlockup detector, it can still be
> > useful to use even on systems that _do_ have an arch-specific
> > hardlockup detector. On x86, for instance, there is a 24-part patch
> > series [4] in progress switching the arch-specific hard lockup
> > detector from a scarce perf counter to a less-scarce hardware
> > resource. Potentially the buddy system could be a simpler alternative
> > to free up the perf counter but still get hard lockup detection.
> >
> > Overall, pros (+) and cons (-) of the buddy system compared to an
> > arch-specific hardlockup detector:
> > + Usable on systems that don't have an arch-specific hardlockup
> >   detector, like arm32 and arm64 (though it's being worked on for
> >   arm64 [5]).
> > + May free up scarce hardware resources.
> > + If a CPU totally goes out to lunch (can't process NMIs) the buddy
> >   system could still detect the problem (though it would be unlikely
> >   to be able to get a stack trace).
> > - If all CPUs are hard locked up at the same time the buddy system
> >   can't detect it.
> > - If we don't have SMP we can't use the buddy system.
> > - The buddy system needs an arch-specific mechanism (possibly NMI
> >   backtrace) to get info about the locked up CPU.
> >
> > [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org
> > [2] https://issuetracker.google.com/172213129
> > [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
> > [4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/
> > [5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/
> >
> > Signed-off-by: Colin Cross <ccross@...roid.com>
> > Signed-off-by: Matthias Kaehlcke <mka@...omium.org>
> > Signed-off-by: Guenter Roeck <groeck@...omium.org>
> > Signed-off-by: Tzung-Bi Shih <tzungbi@...omium.org>
> > Signed-off-by: Douglas Anderson <dianders@...omium.org>
> > ---
> > This patch has been rebased in ChromeOS kernel trees many times, and
> > each time someone had to do work on it they added their
> > Signed-off-by. I've included those here. I've also left the author as
> > Colin Cross since the core code is still his.
> >
> > I'll also note that the CC list is pretty giant, but that's what
> > get_maintainers came up with (plus a few other folks I thought would
> > be interested). As far as I can tell, there's no true MAINTAINER
> > listed for the existing watchdog code. Assuming people don't hate
> > this, maybe it would go through Andrew Morton's tree?
> >
> > Changes in v2:
> > - cpu => CPU.
> > - Reworked description and Kconfig based on v1 discussion.
>
> or at least some of the comments from v1. :(

Oh no! My email program confused me and I thought all of your cpu=>CPU
stuff was in the patch description, not in the Kconfig. I'll whip up a
quick v3.

-Doug

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ