[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150409082125.GE25434@pd.tnic>
Date: Thu, 9 Apr 2015 10:21:25 +0200
From: Borislav Petkov <bp@...en8.de>
To: Ingo Molnar <mingo@...nel.org>
Cc: Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
Tony Luck <tony.luck@...el.com>,
Prarit Bhargava <prarit@...hat.com>,
Vivek Goyal <vgoyal@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Junichi Nomura <j-nomura@...jp.nec.com>,
Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> So the thing is, when we boot up the second kernel there will be a
> window where the old handler isn't valid (because the new kernel has
> its own pagetables, etc.) and the new handler is not installed yet.
>
> If an MCE hits that window, it's bad luck. (unless the bootup sequence
> is rearchitected significantly to allow cross-kernel inheritance of
> MCE handlers.)
>
> So I think we can ignore _that_ race.
Yah, that's the "tough luck" race.
> The more significant question is: what happens when an MCE arrives
> whiel the kdump is proceeding - as kdumps can take a long time to
> finish when there's a lot of RAM.
We say that the dump might be unreliable.
> But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down
> sequence, I suppose that the existing MCE code should already properly
> handle the case where an MCE arrives on a (supposedly) dead CPU,
> right? In that case installing a separate MCE handler looks like the
> wrong thing.
Hmm, so mce_start() does look only on the online CPUs. So if crash does
maintain those masks correctly...
> So I don't like this principle either: 'our current code is a mess
> that might not work, add new one'.
Well, we can try to simplify it in the sense that those assumptions like
mcelog and other MCE consuming crap and notifier chain are tested for
their presence before using them...
I'd be open for this if we have a way to test this kdump scenario. For
now, not even qemu can do that.
> Looks like that's the real problem. How about the kdump crash dumper
> sets it back to 'ignore' again when we crash, and also double check
> how we handle various corner cases?
I think I even suggested that at some point. Or was it to increase the
tolerance level. So Naoya, what's wrong with this again? I forgot.
Because this would be the simplest. Simply set tolerance level to 3 and
dump away...
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists