[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YoShZVYNAdvvjb7z@alley>
Date: Wed, 18 May 2022 09:33:57 +0200
From: Petr Mladek <pmladek@...e.com>
To: "Guilherme G. Piccoli" <gpiccoli@...lia.com>
Cc: Evan Green <evgreen@...omium.org>, David Gow <davidgow@...gle.com>,
Julius Werner <jwerner@...omium.org>,
Scott Branden <scott.branden@...adcom.com>,
bcm-kernel-feedback-list@...adcom.com,
Sebastian Reichel <sre@...nel.org>,
Linux PM <linux-pm@...r.kernel.org>,
Florian Fainelli <f.fainelli@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>, bhe@...hat.com,
kexec@...ts.infradead.org, LKML <linux-kernel@...r.kernel.org>,
linuxppc-dev@...ts.ozlabs.org, linux-alpha@...r.kernel.org,
linux-arm Mailing List <linux-arm-kernel@...ts.infradead.org>,
linux-edac@...r.kernel.org, linux-hyperv@...r.kernel.org,
linux-leds@...r.kernel.org, linux-mips@...r.kernel.org,
linux-parisc@...r.kernel.org, linux-remoteproc@...r.kernel.org,
linux-s390@...r.kernel.org, linux-tegra@...r.kernel.org,
linux-um@...ts.infradead.org, linux-xtensa@...ux-xtensa.org,
netdev@...r.kernel.org, openipmi-developer@...ts.sourceforge.net,
rcu@...r.kernel.org, sparclinux@...r.kernel.org,
xen-devel@...ts.xenproject.org, x86@...nel.org,
kernel-dev@...lia.com, kernel@...ccoli.net, halves@...onical.com,
fabiomirmar@...il.com, alejandro.j.jimenez@...cle.com,
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
Arnd Bergmann <arnd@...db.de>, Borislav Petkov <bp@...en8.de>,
Jonathan Corbet <corbet@....net>, d.hatayama@...fujitsu.com,
dave.hansen@...ux.intel.com, dyoung@...hat.com,
feng.tang@...el.com,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
mikelley@...rosoft.com, hidehiro.kawai.ez@...achi.com,
jgross@...e.com, john.ogness@...utronix.de,
Kees Cook <keescook@...omium.org>, luto@...nel.org,
mhiramat@...nel.org, mingo@...hat.com, paulmck@...nel.org,
peterz@...radead.org, rostedt@...dmis.org,
senozhatsky@...omium.org, Alan Stern <stern@...land.harvard.edu>,
Thomas Gleixner <tglx@...utronix.de>, vgoyal@...hat.com,
vkuznets@...hat.com, Will Deacon <will@...nel.org>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Andrea Parri <parri.andrea@...il.com>,
Ard Biesheuvel <ardb@...nel.org>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Brian Norris <computersforpeace@...il.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Christophe JAILLET <christophe.jaillet@...adoo.fr>,
"David S. Miller" <davem@...emloft.net>,
Dexuan Cui <decui@...rosoft.com>,
Doug Berger <opendmb@...il.com>,
Haiyang Zhang <haiyangz@...rosoft.com>,
Hari Bathini <hbathini@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
Justin Chen <justinpopo6@...il.com>,
"K. Y. Srinivasan" <kys@...rosoft.com>,
Lee Jones <lee.jones@...aro.org>,
Markus Mayer <mmayer@...adcom.com>,
Michael Ellerman <mpe@...erman.id.au>,
Mihai Carabas <mihai.carabas@...cle.com>,
Nicholas Piggin <npiggin@...il.com>,
Paul Mackerras <paulus@...ba.org>, Pavel Machek <pavel@....cz>,
Shile Zhang <shile.zhang@...ux.alibaba.com>,
Stephen Hemminger <sthemmin@...rosoft.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
Tianyu Lan <Tianyu.Lan@...rosoft.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Wang ShaoBo <bobo.shaobowang@...wei.com>,
Wei Liu <wei.liu@...nel.org>,
zhenwei pi <pizhenwei@...edance.com>,
Stephen Boyd <swboyd@...omium.org>
Subject: Re: [PATCH 19/30] panic: Add the panic hypervisor notifier list
On Tue 2022-05-17 13:37:58, Guilherme G. Piccoli wrote:
> On 17/05/2022 10:28, Petr Mladek wrote:
> > [...]
> >>> Disagree here. I'm looping Google maintainers, so they can comment.
> >>> (CCed Evan, David, Julius)
> >>>
> >>> This notifier is clearly a hypervisor notification mechanism. I've fixed
> >>> a locking stuff there (in previous patch), I feel it's low-risk but even
> >>> if it's mid-risk, the class of such callback remains a perfect fit with
> >>> the hypervisor list IMHO.
> >>
> >> This logs a panic to our "eventlog", a tiny logging area in SPI flash
> >> for critical and power-related events. In some cases this ends up
> >> being the only clue we get in a Chromebook feedback report that a
> >> panic occurred, so from my perspective moving it to the front of the
> >> line seems like a good idea.
> >
> > IMHO, this would really better fit into the pre-reboot notifier list:
> >
> > + the callback stores the log so it is similar to kmsg_dump()
> > or console_flush_on_panic()
> >
> > + the callback should be proceed after "info" notifiers
> > that might add some other useful information.
> >
> > Honestly, I am not sure what exactly hypervisor callbacks do. But I
> > think that they do not try to extract the kernel log because they
> > would need to handle the internal format.
> >
>
> I guess the main point in your response is : "I am not sure what exactly
> hypervisor callbacks do". We need to be sure about the semantics of such
> list, and agree on that.
>
> So, my opinion about this first list, that we call "hypervisor list",
> is: it contains callbacks that
>
> (1) should run early, preferably before kdump (or even if kdump isn't
> set, should run ASAP);
>
> (2) these callbacks perform some communication with an abstraction that
> runs "below" the kernel, like a firmware or hypervisor. Classic example:
> pvpanic, that communicates with VMM (usually qemu) and allow such VMM to
> snapshot the full guest memory, for example.
>
> (3) Should be low-risk. What defines risk is the level of reliability of
> subsequent operations - if the callback have 50% of chance of "bricking"
> the system totally and prevent kdump / kmsg_dump() / reboot , this is
> high risk one for example.
>
> Some good fits IMO: pvpanic, sstate_panic_event() [sparc], fadump in
> powerpc, etc.
>
> So, this is a good case for the Google notifier as well - it's not
> collecting data like the dmesg (hence your second bullet seems to not
> apply here, info notifiers won't add info to be collected by gsmi). It
> is a firmware/hypervisor/whatever-gsmi-is notification mechanism, that
> tells such "lower" abstraction a panic occurred. It seems low risk and
> we want it to run ASAP, if possible.
"
> >> This logs a panic to our "eventlog", a tiny logging area in SPI flash
> >> for critical and power-related events. In some cases this ends up
I see. I somehow assumed that it was about the kernel log because
Evans wrote:
"This logs a panic to our "eventlog", a tiny logging area in SPI flash
for critical and power-related events. In some cases this ends up"
Anyway, I would distinguish it the following way.
+ If the notifier is preserving kernel log then it should be ideally
treated as kmsg_dump().
+ It the notifier is saving another debugging data then it better
fits into the "hypervisor" notifier list.
Regarding the reliability. From my POV, any panic notifier enabled
in a generic kernel should be reliable with more than 99,9%.
Otherwise, they should not be in the notifier list at all.
An exception would be a platform-specific notifier that is
called only on some specific platform and developers maintaining
this platform agree on this.
The value "99,9%" is arbitrary. I am not sure if it is realistic
even in the other code, for example, console_flush_on_panic()
or emergency_restart(). I just want to point out that the border
should be rather high. Otherwise we would back in the situation
where people would want to disable particular notifiers.
Best Regards,
Petr
Powered by blists - more mailing lists