[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YZKFDgtaBtvD6NIz@zn.tnic>
Date: Mon, 15 Nov 2021 17:04:30 +0100
From: Borislav Petkov <bp@...en8.de>
To: "Chatradhi, Naveen Krishna" <nchatrad@....com>
Cc: linux-edac@...r.kernel.org, x86@...nel.org,
linux-kernel@...r.kernel.org, mingo@...hat.com, mchehab@...nel.org,
yazen.ghannam@....com, Muralidhara M K <muralimk@....com>
Subject: Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU
nodes
On Mon, Nov 15, 2021 at 08:54:55PM +0530, Chatradhi, Naveen Krishna wrote:
> The errors are not specific to GPUs, the errors are originating from HBM2e
> memory chips on the GPU.
>
> As a first step, I'm trying to leverage the existing EDAC interfaces to
> report the HBM errors.
Report them how? How do the HBM chips fit in the EDAC sysfs hierarchy?
Does it even work with the current hierarchy or does EDAC need more
major restructuring?
You can send me an example from sysfs on such a system, privately is
fine too.
> Page retirement and storing the bad pages info on a persistent storage can
> be the next steps.
If you're thinking about plugging this into memory_failure(), then this
has nothing to do with EDAC.
All EDAC can give you is error count numbers in sysfs.
So I'd like to see where this is going first, and whether it is even
worth it adding it to EDAC.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists