linux-kernel - Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <YZKFDgtaBtvD6NIz@zn.tnic>
Date:   Mon, 15 Nov 2021 17:04:30 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Chatradhi, Naveen Krishna" <nchatrad@....com>
Cc:     linux-edac@...r.kernel.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, mingo@...hat.com, mchehab@...nel.org,
        yazen.ghannam@....com, Muralidhara M K <muralimk@....com>
Subject: Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU
 nodes

On Mon, Nov 15, 2021 at 08:54:55PM +0530, Chatradhi, Naveen Krishna wrote:
> The errors are not specific to GPUs, the errors are originating from HBM2e
> memory chips on the GPU.
> 
> As a first step, I'm trying to leverage the existing EDAC interfaces to
> report the HBM errors.

Report them how? How do the HBM chips fit in the EDAC sysfs hierarchy?
Does it even work with the current hierarchy or does EDAC need more
major restructuring?

You can send me an example from sysfs on such a system, privately is
fine too.

> Page retirement and storing the bad pages info on a persistent storage can
> be the next steps.

If you're thinking about plugging this into memory_failure(), then this
has nothing to do with EDAC.

All EDAC can give you is error count numbers in sysfs.

So I'd like to see where this is going first, and whether it is even
worth it adding it to EDAC.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette