linux-kernel - Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4107582e-03e7-4edf-8c50-6bf693f2d18e@amd.com>
Date: Wed, 14 Feb 2024 11:01:36 -0500
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: yazen.ghannam@....com, tony.luck@...el.com, linux-edac@...r.kernel.org,
 linux-kernel@...r.kernel.org, avadhut.naik@....com, john.allen@....com,
 muralidhara.mk@....com, naveenkrishna.chatradhi@....com,
 sathyapriya.k@....com
Subject: Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager

On 2/14/2024 10:49 AM, Borislav Petkov wrote:
> On Wed, Feb 14, 2024 at 09:21:45AM -0500, Yazen Ghannam wrote:
>> Do you mean this should be left out of the commit message?
> 
> Yes, the text should talk only about what the patch does. What can and
> will and won't happen in the future doesn't matter.
>

Got it.
  
> IOW, here's what I have now:
> 
> RAS: Introduce a FRU memory poison manager
> 
> Memory errors are an expected occurrence on systems with high memory
> density. Generally, errors within a small number of unique physical
> locations are acceptable, based on manufacturer and/or admin policy.
> During run time, memory with errors may be retired so it is no longer
> used by the system. This is done in mm through page poisoning, and the
> effect will remain until the system is restarted.
> 
> If a memory location is consistently faulty, then the same run time
> error handling may occur in the next reboot cycle, leading to
> terminating jobs due to that already known bad memory. This could be
> prevented if information from the previous boot was not lost.
> 
> Some add-in cards with driver-managed memory have on-board persistent
> storage. Their driver saves memory error information to the persistent
> storage during run time. The information is then be restored after

"then be" -> "then"

> reset, and known bad memory will be retired before the hardware is used.
> A running log of bad memory locations is kept across multiple resets.
> 
> A similar solution is desirable for CPUs. However, this solution should
> leverage industry-standard components as much as possible, rather than
> a bespoke platform driver.
> 
> Two components are needed: a record format and a persistent storage
> interface.
> 
> Implement a new module to manage the record formats on persistent
> storage. Use the requirements for an AMD MI300-based system to start.
> Vendor- and platform-specific details can be abstracted later as needed.
> 
>    [ bp: Massage commit message. ]
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
> Signed-off-by: Borislav Petkov (AMD) <bp@...en8.de>
> Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
> 

Otherwise, looks good.

Thanks,
Yazen