lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b384d621-6b2d-7aab-adbf-7045f23f4af9@linux.alibaba.com>
Date:   Wed, 10 May 2023 10:17:18 +0800
From:   Shuai Xue <xueshuai@...ux.alibaba.com>
To:     Yazen Ghannam <yazen.ghannam@....com>, bp@...en8.de,
        tony.luck@...el.com
Cc:     tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
        x86@...nel.org, hpa@...or.com, baolin.wang@...ux.alibaba.com,
        linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory
 failure



On 2023/5/9 22:25, Yazen Ghannam wrote:
> On 4/25/23 8:18 AM, Shuai Xue wrote:
>> When a deferred UE error is detected, e.g by background patrol scruber, it
>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>> The handler will collect MCA banks, init mce struct and process it by
>> nofitying the registered MCE decode chain.
>>
>> The uc_decode_notifier, one of MCE decode chain, will process memory
>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>> However, APIC interrupt handler does not init mce severity and the
>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>
>> To handle the deferred memory failure case, init mce severity when logging
>> MCA banks.
>>
>> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
>>
> 
> Hi Shuai Xue,
> 
> I think this patch is fair to do. But it won't have the intended effect
> in practice.
> 
> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
> "normalized address". This is not a system physical address that the OS
> can use to take action.
> 
> The mce_usable_address() function needs to be updated to handle this.
> I'll send a patchset this week to do so. Afterwards, the
> uc_decode_notifier will not attempt to handle these errors.

>From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
uc_decode_notifier should handle these error to hard offline the corrupted
page. If the corrupted page is a free buddy page, we can isolate it and avoid
using the page in the future.

In my test case, the error is detected by patrol scrubber in memory controller.
The scrubber may lack of system address space perspective, and only reports
"normalized address". But we can decode the "normalized address" to system address
by EDAC (umc_normaddr_to_sysaddr), right?

(I am not quite familiar with AMD RAS, please correct me if I am wrong)

> 
> Thanks,
> Yazen

Thank you.

Best Regards,
Shuai

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ