[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4107582e-03e7-4edf-8c50-6bf693f2d18e@amd.com>
Date: Wed, 14 Feb 2024 11:01:36 -0500
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: yazen.ghannam@....com, tony.luck@...el.com, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, avadhut.naik@....com, john.allen@....com,
muralidhara.mk@....com, naveenkrishna.chatradhi@....com,
sathyapriya.k@....com
Subject: Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager
On 2/14/2024 10:49 AM, Borislav Petkov wrote:
> On Wed, Feb 14, 2024 at 09:21:45AM -0500, Yazen Ghannam wrote:
>> Do you mean this should be left out of the commit message?
>
> Yes, the text should talk only about what the patch does. What can and
> will and won't happen in the future doesn't matter.
>
Got it.
> IOW, here's what I have now:
>
> RAS: Introduce a FRU memory poison manager
>
> Memory errors are an expected occurrence on systems with high memory
> density. Generally, errors within a small number of unique physical
> locations are acceptable, based on manufacturer and/or admin policy.
> During run time, memory with errors may be retired so it is no longer
> used by the system. This is done in mm through page poisoning, and the
> effect will remain until the system is restarted.
>
> If a memory location is consistently faulty, then the same run time
> error handling may occur in the next reboot cycle, leading to
> terminating jobs due to that already known bad memory. This could be
> prevented if information from the previous boot was not lost.
>
> Some add-in cards with driver-managed memory have on-board persistent
> storage. Their driver saves memory error information to the persistent
> storage during run time. The information is then be restored after
"then be" -> "then"
> reset, and known bad memory will be retired before the hardware is used.
> A running log of bad memory locations is kept across multiple resets.
>
> A similar solution is desirable for CPUs. However, this solution should
> leverage industry-standard components as much as possible, rather than
> a bespoke platform driver.
>
> Two components are needed: a record format and a persistent storage
> interface.
>
> Implement a new module to manage the record formats on persistent
> storage. Use the requirements for an AMD MI300-based system to start.
> Vendor- and platform-specific details can be abstracted later as needed.
>
> [ bp: Massage commit message. ]
>
> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
> Signed-off-by: Borislav Petkov (AMD) <bp@...en8.de>
> Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
>
Otherwise, looks good.
Thanks,
Yazen
Powered by blists - more mailing lists