[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5398da4c-5286-4e1b-924c-6df91f932427@intel.com>
Date: Tue, 17 Oct 2023 23:00:01 +0800
From: Zhiquan Li <zhiquan1.li@...el.com>
To: Borislav Petkov <bp@...en8.de>, "Luck, Tony" <tony.luck@...el.com>
CC: "x86@...nel.org" <x86@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"patches@...ts.linux.dev" <patches@...ts.linux.dev>,
"mingo@...nel.org" <mingo@...nel.org>,
"naoya.horiguchi@....com" <naoya.horiguchi@....com>
Subject: Re: [PATCH v3] x86/mce: Set PG_hwpoison page flag to avoid the
capture kernel panic
On 2023/10/17 19:18, Borislav Petkov wrote:
> ... for the simple reason that the kernel cannot allow itself to do any
> unnecessary work but panic immediately so that it can stop the
> propagation of bad data.
>
> Now, it's a whole different story whether that's the right thing to do
> and whether the data has already propagated so that the panic is moot.
>
> The whole point I'm trying to make is that the machine panics because
> the error severity dictates it to do so. And there's no opportunity to
> queue recovery work because it simply cannot in that case. So the commit
> message should simply state that we're marking the page as poison for
> the kexec'ed kernel's sake and not because of anything else.
>
Wonderful! Thanks for your detail explanation, Boris!
I think I got the point why you emphasized "can't make the kernel
survive" before. In such scenario the consideration for recovery
doesn’t make sense at all, even thought there is opportunity it
shouldn’t do that, the only choice is panic ASAP.
>> If kexec is enabled, check for memory errors and mark the
>> page as poisoned so that the kexec'd kernel can avoid accessing
>> the page.
> Yap, yours makes sense.
Tony, your commit message made me realize how verbose my commit message
is. May I simplify the whole commit message as following for next version?
---start---
Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. When there is a fatal machine check Linux calls mce_panic()
without checking to see if bad data at some memory address
was reported in the machine check banks.
If kexec is enabled, check for memory errors and mark the page as
poisoned so that the kexec'ed kernel can avoid accessing the page.
---end---
It already covers the scenario, root cause and solution, and focuses on
kernel. No need to talk something else.
Thanks to both of you for great insights.
Best Regards,
Zhiquan
Powered by blists - more mailing lists