[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa383f43-b57d-47f7-9b54-1169956586cb@redhat.com>
Date: Tue, 5 Mar 2024 17:31:46 +0100
From: Jocelyn Falempe <jfalempe@...hat.com>
To: John Ogness <john.ogness@...utronix.de>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Daniel Vetter <daniel@...ll.ch>, Andrew Morton <akpm@...ux-foundation.org>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Josh Poimboeuf <jpoimboe@...nel.org>, Arnd Bergmann <arnd@...db.de>,
Kefeng Wang <wangkefeng.wang@...wei.com>, Lukas Wunner <lukas@...ner.de>,
Uros Bizjak <ubizjak@...il.com>, "Guilherme G. Piccoli"
<gpiccoli@...lia.com>, Petr Mladek <pmladek@...e.com>,
Daniel Thompson <daniel.thompson@...aro.org>,
Douglas Anderson <dianders@...omium.org>
Cc: "dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
David Airlie <airlied@...hat.com>, Thomas Zimmermann <tzimmermann@...e.de>
Subject: Re: [RFC] How to test panic handlers, without crashing the kernel
On 04/03/2024 22:12, John Ogness wrote:
> [Added printk maintainer and kdb folks]
>
> Hi Jocelyn,
>
> On 2024-03-01, Jocelyn Falempe <jfalempe@...hat.com> wrote:
>> While writing a panic handler for drm devices [1], I needed a way to
>> test it without crashing the machine.
>> So from debugfs, I called
>> atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the
>> side effect of calling all other panic notifiers registered.
>>
>> So Sima suggested to move that to the generic panic code, and test all
>> panic notifiers with a dedicated debugfs interface.
>>
>> I can move that code to kernel/, but before doing that, I would like to
>> know if you think that's the right way to test the panic code.
>
> One major event that happens before the panic notifiers is
> panic_other_cpus_shutdown(). This can cause special situations because
> CPUs can be stopped while holding resources (such as raw spin
> locks). And these are the situations that make it so tricky to have safe
> and reliable notifiers. If triggered from debugfs, these situations will
> never occur.
>
> My concern is that the tests via debugfs will always succeed, but in the
> real world panic notifiers are failing/hanging/exploding. IMHO useful
> panic testing requires real panic'ing.
Yes, but for the drm panic, it's still useful to check that the output
is working (ie: make sure the color format and the framebuffer address
are good). Also I've reworked the debugfs patch, so I don't have to call
all panic notifiers. It's now per device, so your can trigger the
drm_panic handler on a specific GPU.
>
> For my printk panic tests I trigger unknown NMIs while booting with
> "unknown_nmi_panic". Particularly with Qemu this is quite easy and
> amazingly effective at catching problems. In fact, a recent printk
> series [0] fixed seven issues that were found through this method of
> panic testing.
Thanks for this tip, I used to test with "echo c > /proc/sysrq-trigger"
in the guest, but that's more permissive. I'm now testing with virsh
inject-nmi, and drm_panic is still working.
>
>> The second question is how to simulate a panic context in a
>> non-destructive way, so we can test the panic notifiers in CI, without
>> crashing the machine.
>
> I'm wondering if a "fake panic" can be implemented that quiesces all the
> other CPUs via NMI (similar to kdb) and then calls the panic
> notifiers. And finally releases everything back to normal. That might
> produce a fairly realistic panic situation and should be fairly
> non-destructive (depending on what the notifiers do and how long they
> take).
>
>> The worst case for a panic notifier, is when the panic occurs in NMI
>> context, but I don't know how to simulate that. The goal would be to
>> find early if a panic notifier tries to sleep, or do other things that
>> are not allowed in a panic context.
>
> Maybe with a new boot argument "unknown_nmi_fake_panic" that triggers
> the fake panic instead?
>
> John Ogness
>
> [0] https://lore.kernel.org/lkml/20240207134103.1357162-1-john.ogness@linutronix.de
>
Best regards,
--
Jocelyn
Powered by blists - more mailing lists