linux-kernel - Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YU8GGSrQSbAZPz4z@zn.tnic>
Date:   Sat, 25 Sep 2021 13:20:57 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Yazen Ghannam <yazen.ghannam@....com>
Cc:     "Joshi, Mukul" <Mukul.Joshi@....com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "mchehab@...nel.org" <mchehab@...nel.org>,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>
Subject: Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

On Fri, Sep 24, 2021 at 07:46:10PM +0000, Yazen Ghannam wrote:
> I agree with you in general. But this device isn't really a GPU. And
> users of this device seem to want to count *every* error, at least for
> now.

Aha, so something accelerator-y where they do general purpose computation.

So what's the big picture here: they count all the errors and when they
reach a certain amount, they decide to replace the GPUs just in case?

Or wait until they become uncorrectable? But then it doesn't matter
because we will handle it properly by excluding the VRAM range from
further use.

Or do they wanna see *when* they had the correctable errors so that they
can restart the computation, just in case.

Dunno, it would be a lot helpful if we had some RAS strategy for those
things...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette