lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d83545f0-af15-10bc-0f5d-9b531b54b9dd@arm.com>
Date:   Thu, 30 Nov 2023 17:43:38 +0000
From:   James Morse <james.morse@....com>
To:     Borislav Petkov <bp@...en8.de>,
        Shuai Xue <xueshuai@...ux.alibaba.com>
Cc:     rafael@...nel.org, wangkefeng.wang@...wei.com,
        tanxiaofei@...wei.com, mawupeng1@...wei.com, tony.luck@...el.com,
        linmiaohe@...wei.com, naoya.horiguchi@....com,
        gregkh@...uxfoundation.org, will@...nel.org, jarkko@...nel.org,
        linux-acpi@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
        linux-edac@...r.kernel.org, acpica-devel@...ts.linuxfoundation.org,
        stable@...r.kernel.org, x86@...nel.org, justin.he@....com,
        ardb@...nel.org, ying.huang@...el.com, ashish.kalra@....com,
        baolin.wang@...ux.alibaba.com, tglx@...utronix.de,
        mingo@...hat.com, dave.hansen@...ux.intel.com, lenb@...nel.org,
        hpa@...or.com, robert.moore@...el.com, lvying6@...wei.com,
        xiexiuqi@...wei.com, zhuo.song@...ux.alibaba.com
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work
 with proper si_code

Hi Boris,

On 30/11/2023 14:40, Borislav Petkov wrote:
> FTR, this is starting to make sense, thanks for explaining.
> 
> Replying only to this one for now:
> 
> On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote:
>> To reproduce this problem:
>>
>> 	# STEP1: enable early kill mode
>> 	#sysctl -w vm.memory_failure_early_kill=1
>> 	vm.memory_failure_early_kill = 1
>>
>> 	# STEP2: inject an UCE error and consume it to trigger a synchronous error
> 
> So this is for ARM folks to deal with, BUT:
> 
> A consumed uncorrectable error on x86 means panic. On some hw like on
> AMD, that error doesn't even get seen by the OS but the hw does
> something called syncflood to prevent further error propagation. So
> there's no any action required - the hw does that.
> 
> But I'd like to hear from ARM folks whether consuming an uncorrectable
> error even lets software run. Dunno.

I think we mean different things by 'consume' here.

I'd assume Shuai's test is poisoning a cache-line. When the CPU tries to access that
cache-line it will get an 'external abort' signal back from the memory system. Shuai - is
this what you mean by 'consume' - the CPU received external abort from the poisoned cache
line?

It's then up to the CPU whether it can put the world back in order to take this as
synchronous-external-abort or asynchronous-external-abort, which for arm64 are two
different interrupt/exception types.
The synchronous exceptions can't be masked, but the asynchronous one can.
If by the time the asynchronous-external-abort interrupt/exception has been unmasked, the
CPU has used the poisoned value in some calculation (which is what we usually mean by
consume) which has resulted in a memory access - it will report the error as 'uncontained'
because the error has been silently propagated. APEI should always report those a 'fatal',
and there is little point getting the OS involved at this point. Also in this category are
things like 'tag ram corruption', where you can no longer trust anything about memory.

Everything in this thread is about synchronous errors where this can't happen. The CPU
stops and does takes an interrupt/exception instead.


Thanks,

James

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ