linux-kernel - Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240223121701.00004bcf@Huawei.com>
Date: Fri, 23 Feb 2024 12:17:01 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Dan Williams <dan.j.williams@...el.com>
CC: Shuai Xue <xueshuai@...ux.alibaba.com>, Borislav Petkov <bp@...en8.de>,
	Ira Weiny <ira.weiny@...el.com>, "Luck, Tony" <tony.luck@...el.com>,
	"james.morse@....com" <james.morse@....com>, <rafael@...nel.org>,
	<wangkefeng.wang@...wei.com>, <tanxiaofei@...wei.com>,
	<mawupeng1@...wei.com>, <linmiaohe@...wei.com>, <naoya.horiguchi@....com>,
	<gregkh@...uxfoundation.org>, <will@...nel.org>, <jarkko@...nel.org>,
	<linux-acpi@...r.kernel.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <akpm@...ux-foundation.org>,
	<linux-edac@...r.kernel.org>, <x86@...nel.org>, <justin.he@....com>,
	<ardb@...nel.org>, <ying.huang@...el.com>, <ashish.kalra@....com>,
	<baolin.wang@...ux.alibaba.com>, <tglx@...utronix.de>, <mingo@...hat.com>,
	<dave.hansen@...ux.intel.com>, <lenb@...nel.org>, <hpa@...or.com>,
	<robert.moore@...el.com>, <lvying6@...wei.com>, <xiexiuqi@...wei.com>,
	<zhuo.song@...ux.alibaba.com>
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if
 synchronous memory error not recovered

On Fri, 23 Feb 2024 12:08:13 +0000
Jonathan Cameron <Jonathan.Cameron@...wei.com> wrote:

> On Thu, 22 Feb 2024 21:26:43 -0800
> Dan Williams <dan.j.williams@...el.com> wrote:
> 
> > Shuai Xue wrote:  
> > > 
> > > 
> > > On 2024/2/19 17:25, Borislav Petkov wrote:    
> > > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:    
> > > >> Synchronous error was detected as a result of user-space process accessing
> > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > > >> memory_failure() work which poisons the related page, unmaps the page, and
> > > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > > >> avoided.
> > > >>
> > > >> However, no memory_failure() work will be queued when abnormal synchronous
> > > >> errors occur. These errors can include situations such as invalid PA,
> > > >> unexpected severity, no memory failure config support, invalid GUID
> > > >> section, etc. In such case, the user-space process will trigger SEA again.
> > > >> This loop can potentially exceed the platform firmware threshold or even
> > > >> trigger a kernel hard lockup, leading to a system reboot.
> > > >>
> > > >> Fix it by performing a force kill if no memory_failure() work is queued
> > > >> for synchronous errors.
> > > >>
> > > >> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
> > > >> ---
> > > >>  drivers/acpi/apei/ghes.c | 9 +++++++++
> > > >>  1 file changed, 9 insertions(+)
> > > >>
> > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > >> index 7b7c605166e0..0892550732d4 100644
> > > >> --- a/drivers/acpi/apei/ghes.c
> > > >> +++ b/drivers/acpi/apei/ghes.c
> > > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > > >>  		}
> > > >>  	}
> > > >>  
> > > >> +	/*
> > > >> +	 * If no memory failure work is queued for abnormal synchronous
> > > >> +	 * errors, do a force kill.
> > > >> +	 */
> > > >> +	if (sync && !queued) {
> > > >> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > > >> +		force_sig(SIGBUS);
> > > >> +	}    
> > > > 
> > > > Except that there are a bunch of CXL GUIDs being handled there too and
> > > > this will sigbus those processes now automatically.    
> > > 
> > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> > > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> > > delivered as a synchronous notification.
> > > 
> > > Will the CXL component trigger synchronous events for which we need to terminate the
> > > current process by sending sigbus to process?    
> > 
> > None of the CXL component errors should be handled as synchronous
> > events. They are either asynchronous protocol errors, or effectively
> > equivalent to CPER_SEC_PLATFORM_MEM notifications.  
> 
> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
> 

Premature send.:(

One example I can point at is how we do signaling of memory
errors detected by the host into a VM on arm64.
https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).

Right now we've only used async in QEMU for proposed CXL error
CPER records signalling but your reference to them being similar
to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
synchronous in some physical systems as it's one viable way to
provide rich information for synchronous reception of poison.
For the VM case my assumption today is we don't care about providing the
VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
errors whether from CXL CPER records or not.

Jonathan