linux-kernel - Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 10 Oct 2023 08:56:38 +0800
From:   Zhiquan Li <zhiquan1.li@...el.com>
To:     Ingo Molnar <mingo@...nel.org>
CC:     <x86@...nel.org>, <linux-edac@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <patches@...ts.linux.dev>,
        <bp@...en8.de>, <tony.luck@...el.com>, <naoya.horiguchi@....com>
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the
 capture kernel panic

On 2023/10/3 03:06, Ingo Molnar wrote:
> The English in this commit is *atrocious*, both in the changelog and in
> the comments - how on Earth did 'Posion' typo and half a dozen other
> typos and bad grammar survive ~3 iterations and a Reviewed-by tag?? The
> version below fixes up the worst, but I suspect that's not the only
> problem with this patch...

Many thanks for your attention and fixes up, Ingo.

I’d like to introduce more background of this patch.

Memory errors don’t happen very often, especially the severity is fatal.
 However, in large-scale scenarios, such as data centers, it might still
happen.  For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, but not try to
make the kernel survive via memory_failure() handling.  Unfortunately,
the capture kernel will panic for the same reason if it touches the
error memory again.  The consequence is that only an incomplete vmcore
is left for sustaining engineers, it’s a big headache for them to make
clear what happened in the past.

We had considered 3 solutions and finally chose the last one.

1. When the capture kernel boots up, re-scans the MCE banks to check if
   there are fatal errors, set the PG_hwpoison flag for each error
   pages.
   We can foresee this solution is heavy.  It needs to find the struct
   page of error pages from old memory and set the flag.  Looks like we
   need to remake the wheel, so we gave up it.

2. Replace the function copy_to_iter() at __copy_oldmem_page() with the
   function _copy_mc_to_iter(), which is a #MC safe version.
   This solution is lightweight but has following drawbacks:

   1) Such issues are quite rare events; we don’t want to use a #MC safe
      copy to accommodate it. Especially, if the problem can be deal
      with by MCE handling rather than touching the Kdump stuff.

   2) The #MC safe copy is conditionally, whether it can fix the #MC
      error depends on MCE handling can reach the fixup_exception()
      function at do_machine_check().  However, in fatal error case, it
      might invoke mce_panic() to crash the capture kernel earlier than
      fixing up the error.

3. The solution in this patch overcomes all above drawbacks.  It set the
   flag just before the production kernel calls panic(), which would not
   introduce additional overhead in capture kernel or conflict with
   other hwpoision-related code in production kernel.  Furthermore, it
   leverages the already existing mechanisms to fix the issue as much as
   possible, the code changes are also lightweight.

To verify the fix is not difficult.  The issue can be simulated by
ras-tools
(https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git),
"copyout" test case.  It can inject a fatal memory error in kernel space
via APEI ENIJ interface (need hardware platform support), and then it
touches the error page to produce the issue.  The patch has been
validated by this tool.

Any idea is welcome!

Best Regards,
Zhiquan