[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221025054559.GA4093658@hori.linux.bs1.fc.nec.co.jp>
Date: Tue, 25 Oct 2022 05:46:01 +0000
From: HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@....com>
To: Tony Luck <tony.luck@...el.com>
CC: Andrew Morton <akpm@...ux-foundation.org>,
Miaohe Lin <linmiaohe@...wei.com>,
Matthew Wilcox <willy@...radead.org>,
Shuai Xue <xueshuai@...ux.alibaba.com>,
Dan Williams <dan.j.williams@...el.com>,
Michael Ellerman <mpe@...erman.id.au>,
Nicholas Piggin <npiggin@...il.com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>
Subject: Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write
faults
On Fri, Oct 21, 2022 at 01:01:19PM -0700, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
>
> I wrapped that neatly into a test at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> just enable ACPI error injection and run:
>
> # ./einj_mem-uc -f copy-on-write
>
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
>
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
>
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
>
> Reviewed-by: Dan Williams <dan.j.williams@...el.com>
> Signed-off-by: Tony Luck <tony.luck@...el.com>
Thank you for the update. Looks good to me.
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@....com>
Powered by blists - more mailing lists