[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220129080027.GC17613@MiWiFi-R3L-srv>
Date: Sat, 29 Jan 2022 16:00:27 +0800
From: Baoquan He <bhe@...hat.com>
To: Petr Mladek <pmladek@...e.com>
Cc: "Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
"Guilherme G. Piccoli" <gpiccoli@...lia.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"kernel@...ccoli.net" <kernel@...ccoli.net>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"john.ogness@...utronix.de" <john.ogness@...utronix.de>,
"feng.tang@...el.com" <feng.tang@...el.com>,
"kexec@...ts.infradead.org" <kexec@...ts.infradead.org>,
"dyoung@...hat.com" <dyoung@...hat.com>,
"keescook@...omium.org" <keescook@...omium.org>,
"anton@...msg.org" <anton@...msg.org>,
"ccross@...roid.com" <ccross@...roid.com>,
"tony.luck@...el.com" <tony.luck@...el.com>
Subject: Re: [PATCH V3] panic: Move panic_print before kmsg dumpers
On 01/26/22 at 12:51pm, Petr Mladek wrote:
> On Mon 2022-01-24 16:57:17, Michael Kelley (LINUX) wrote:
> > From: Baoquan He <bhe@...hat.com> Sent: Friday, January 21, 2022 8:34 PM
> > >
> > > On 01/21/22 at 03:00pm, Michael Kelley (LINUX) wrote:
> > > > From: Baoquan He <bhe@...hat.com> Sent: Thursday, January 20, 2022 6:31 PM
> > > > >
> > > > > On 01/20/22 at 06:36pm, Guilherme G. Piccoli wrote:
> > > > > > Hi Baoquan, some comments inline below:
> > > > > >
> > > > > > On 20/01/2022 05:51, Baoquan He wrote:
> >
> > [snip]
> >
> > > > > > Do you think it should be necessary?
> > > > > > How about if we allow users to just "panic_print" with or without the
> > > > > > "crash_kexec_post_notifiers", then we pursue Petr suggestion of
> > > > > > refactoring the panic notifiers? So, after this future refactor, we
> > > > > > might have a much clear code.
> > > > >
> > > > > I haven't read Petr's reply in another panic notifier filter thread. For
> > > > > panic notifier, it's only enforced to use on HyperV platform, excepto of
> > > > > that, users need to explicitly add "crash_kexec_post_notifiers=1" to enable
> > > > > it. And we got bug report on the HyperV issue. In our internal discussion,
> > > > > we strongly suggest HyperV dev to change the default enablement, instead
> > > > > leave it to user to decide.
> > > > >
> > > >
> > > > Regarding Hyper-V: Invoking the Hyper-V notifier prior to running the
> > > > kdump kernel is necessary for correctness. During initial boot of the
> > > > main kernel, the Hyper-V and VMbus code in Linux sets up several guest
> > > > physical memory pages that are shared with Hyper-V, and that Hyper-V
> > > > may write to. A VMbus connection is also established. Before kexec'ing
> > > > into the kdump kernel, the sharing of these pages must be rescinded
> > > > and the VMbus connection must be terminated. If this isn't done, the
> > > > kdump kernel will see strange memory overwrites if these shared guest
> > > > physical memory pages get used for something else.
> >
> > In the Azure cloud, collecting data before crash dumps is a motivation
> > as well for setting crash_kexec_post_notifiers to true. That way as
> > cloud operator we can see broad failure trends, and in specific cases
> > customers often expect the cloud operator to be able to provide info
> > about a problem even if they have taken a kdump. Where did you
> > envision adding a comment in the code to help clarify these intentions?
> >
> > I looked at the code again, and should revise my previous comments
> > somewhat. The Hyper-V resets that I described indeed must be done
> > prior to kexec'ing the kdump kernel. Most such resets are actually
> > done via __crash_kexec() -> machine_crash_shutdown(), not via the
> > panic notifier. However, the Hyper-V panic notifier must terminate the
> > VMbus connection, because that must be done even if kdump is not
> > being invoked. See commit 74347a99e73.
> >
> > Most of the hangs seen in getting into the kdump kernel on Hyper-V/Azure
> > were probably due to the machine_crash_shutdown() path, and not due
> > to running the panic notifiers prior to kexec'ing the kdump kernel. The
> > exception is terminating the VMbus connection, which had problems that
> > are hopefully now fixed because of adding a timeout.
>
> My undestanding is that we could split the actions into three groups:
>
> 1. Actions that has to be before kexec'ing kdump kernel, like
> resetting physicall memory shared with Hyper-V.
>
> These operation(s) are needed only for kexec and can be done
> in kexec.
>
>
> 2. Notify Hyper-V so that, for example, Azure cloud, could collect
> data before crash dump.
>
> It is nice to have.
>
> It should be configurable if it is not completely safe. I mean
> that there should be a way to disable it when it might increase
> the risk that kexec'ing kdump kernel might fail.
>
>
> 3. Some actions are needed only when panic() ends up in the
> infinite loop.
>
> For example, unloading vmbus channel. At least the commit
> 74347a99e73ae00b8385f ("x86/Hyper-V: Unload vmbus channel in
> hv panic callback") says that it is done in kdump path
> out of box.
>
> All these operations are needed and used only when the kernel is
> running under Hyper-V.
>
> My mine intention is to understand if we need 2 or 3 notifier lists
> or the current one is enough.
>
> The 3 notifier lists would be:
>
> + always do (even before kdump)
> + optionally do before or after kdump
> + needed only when kdump is not called
Totally agree with above suggesitons for Hyper-V. Cleanup as ablove
seems necesary. Stuffing them into panic_notifiers package is not
appropriate.
Powered by blists - more mailing lists