linux-kernel - Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADnq5_NQonmqtFDpfsWygGzA2kv-W-daDSkxkY2ALf9a1eby9g@mail.gmail.com>
Date:   Thu, 13 May 2021 10:32:45 -0400
From:   Alex Deucher <alexdeucher@...il.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     "Joshi, Mukul" <Mukul.Joshi@....com>, x86-ml <x86@...nel.org>,
        "Kasiviswanathan, Harish" <Harish.Kasiviswanathan@....com>,
        lkml <linux-kernel@...r.kernel.org>,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 10:30 AM Borislav Petkov <bp@...en8.de> wrote:
>
> On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> > The bad pages are stored in an EEPROM on the board and the next time
> > the driver loads it reads the EEPROM so that it can reserve the bad
> > pages at init time so they don't get used again.
>
> And that works automagically on the next boot? Because that sounds like
> the right thing to do.

Yes, or driver reload, suspend/resume, etc.

>
> So practically, what happens to a GPU in such a case where the VRAM
> starts going bad? It might get exhausted eventually and the driver will
> say something along the lines of:
>
>   "VRAM bad pages: 80%, consider replacing the GPU. It is operating
>   currently with degrated performance."
>
> or so?

Right.  The sys admin can query the bad page count and decide when to
retire the card.

>
> Yap, from a RAS perspective, that makes good sense as you're prolonging
> the life of the component while still remains operational as good as it
> can and the only user interaction you need is she/he replacing it.
>
> Sounds good.

Yes.  That's the idea.

Alex


>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette