lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 15 Nov 2021 17:04:30 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Chatradhi, Naveen Krishna" <nchatrad@....com>
Cc:     linux-edac@...r.kernel.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, mingo@...hat.com, mchehab@...nel.org,
        yazen.ghannam@....com, Muralidhara M K <muralimk@....com>
Subject: Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU
 nodes

On Mon, Nov 15, 2021 at 08:54:55PM +0530, Chatradhi, Naveen Krishna wrote:
> The errors are not specific to GPUs, the errors are originating from HBM2e
> memory chips on the GPU.
> 
> As a first step, I'm trying to leverage the existing EDAC interfaces to
> report the HBM errors.

Report them how? How do the HBM chips fit in the EDAC sysfs hierarchy?
Does it even work with the current hierarchy or does EDAC need more
major restructuring?

You can send me an example from sysfs on such a system, privately is
fine too.

> Page retirement and storing the bad pages info on a persistent storage can
> be the next steps.

If you're thinking about plugging this into memory_failure(), then this
has nothing to do with EDAC.

All EDAC can give you is error count numbers in sysfs.

So I'd like to see where this is going first, and whether it is even
worth it adding it to EDAC.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ