[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADnq5_NQonmqtFDpfsWygGzA2kv-W-daDSkxkY2ALf9a1eby9g@mail.gmail.com>
Date: Thu, 13 May 2021 10:32:45 -0400
From: Alex Deucher <alexdeucher@...il.com>
To: Borislav Petkov <bp@...en8.de>
Cc: "Joshi, Mukul" <Mukul.Joshi@....com>, x86-ml <x86@...nel.org>,
"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@....com>,
lkml <linux-kernel@...r.kernel.org>,
"amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
On Thu, May 13, 2021 at 10:30 AM Borislav Petkov <bp@...en8.de> wrote:
>
> On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> > The bad pages are stored in an EEPROM on the board and the next time
> > the driver loads it reads the EEPROM so that it can reserve the bad
> > pages at init time so they don't get used again.
>
> And that works automagically on the next boot? Because that sounds like
> the right thing to do.
Yes, or driver reload, suspend/resume, etc.
>
> So practically, what happens to a GPU in such a case where the VRAM
> starts going bad? It might get exhausted eventually and the driver will
> say something along the lines of:
>
> "VRAM bad pages: 80%, consider replacing the GPU. It is operating
> currently with degrated performance."
>
> or so?
Right. The sys admin can query the bad page count and decide when to
retire the card.
>
> Yap, from a RAS perspective, that makes good sense as you're prolonging
> the life of the component while still remains operational as good as it
> can and the only user interaction you need is she/he replacing it.
>
> Sounds good.
Yes. That's the idea.
Alex
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists