linux-kernel - Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <34fa94f5-359f-f3e7-92ae-fcdc06ff19b8@amd.com>
Date:   Wed, 10 May 2023 09:59:36 -0400
From:   Yazen Ghannam <yazen.ghannam@....com>
To:     Shuai Xue <xueshuai@...ux.alibaba.com>, bp@...en8.de,
        tony.luck@...el.com
Cc:     yazen.ghannam@....com, tglx@...utronix.de, mingo@...hat.com,
        dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
        baolin.wang@...ux.alibaba.com, linux-edac@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory
 failure

On 5/9/23 10:17 PM, Shuai Xue wrote:
> 
> 
> On 2023/5/9 22:25, Yazen Ghannam wrote:
>> On 4/25/23 8:18 AM, Shuai Xue wrote:
>>> When a deferred UE error is detected, e.g by background patrol scruber, it
>>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>>> The handler will collect MCA banks, init mce struct and process it by
>>> nofitying the registered MCE decode chain.
>>>
>>> The uc_decode_notifier, one of MCE decode chain, will process memory
>>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>>> However, APIC interrupt handler does not init mce severity and the
>>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>>
>>> To handle the deferred memory failure case, init mce severity when logging
>>> MCA banks.
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
>>>
>>
>> Hi Shuai Xue,
>>
>> I think this patch is fair to do. But it won't have the intended effect
>> in practice.
>>
>> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
>> "normalized address". This is not a system physical address that the OS
>> can use to take action.
>>
>> The mce_usable_address() function needs to be updated to handle this.
>> I'll send a patchset this week to do so. Afterwards, the
>> uc_decode_notifier will not attempt to handle these errors.
> 
> From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
> uc_decode_notifier should handle these error to hard offline the corrupted
> page. If the corrupted page is a free buddy page, we can isolate it and avoid
> using the page in the future.
> 
> In my test case, the error is detected by patrol scrubber in memory controller.
> The scrubber may lack of system address space perspective, and only reports
> "normalized address". But we can decode the "normalized address" to system address
> by EDAC (umc_normaddr_to_sysaddr), right?
> 
> (I am not quite familiar with AMD RAS, please correct me if I am wrong)
>

Yes, that's correct.

The address translation requires some updates that are still in-review.
Afterwards, we can investigate ways to use the translated address. It
may require some rework in the MCE notifier chain or, more simply,
calling memory_failure() from the EDAC module itself.

Thanks,
Yazen