linux-kernel - RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <DM4PR12MB5263161FCF60F73D66EADFDEEED99@DM4PR12MB5263.namprd12.prod.outlook.com>
Date:   Mon, 13 Sep 2021 01:31:31 +0000
From:   "Joshi, Mukul" <Mukul.Joshi@....com>
To:     "Joshi, Mukul" <Mukul.Joshi@....com>,
        "Ghannam, Yazen" <Yazen.Ghannam@....com>
CC:     x86-ml <x86@...nel.org>,
        "Kasiviswanathan, Harish" <Harish.Kasiviswanathan@....com>,
        lkml <linux-kernel@...r.kernel.org>,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>,
        Borislav Petkov <bp@...en8.de>,
        Alex Deucher <alexdeucher@...il.com>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@...ts.freedesktop.org> On Behalf Of Joshi,
> Mukul
> Sent: Thursday, July 29, 2021 8:00 PM
> To: Ghannam, Yazen <Yazen.Ghannam@....com>
> Cc: x86-ml <x86@...nel.org>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@....com>; lkml <linux-kernel@...r.kernel.org>;
> amd-gfx@...ts.freedesktop.org; Borislav Petkov <bp@...en8.de>; Alex Deucher
> <alexdeucher@...il.com>
> Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> [AMD Official Use Only]
> 
> 
> 
> > -----Original Message-----
> > From: Ghannam, Yazen <Yazen.Ghannam@....com>
> > Sent: Thursday, June 3, 2021 5:13 PM
> > To: Joshi, Mukul <Mukul.Joshi@....com>
> > Cc: Borislav Petkov <bp@...en8.de>; Alex Deucher
> > <alexdeucher@...il.com>; x86-ml <x86@...nel.org>; Kasiviswanathan,
> > Harish <Harish.Kasiviswanathan@....com>; lkml
> > <linux-kernel@...r.kernel.org>; amd-gfx@...ts.freedesktop.org
> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for
> > Aldebaran
> >
> > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> > ...
> > > > Is that the same deferred interrupt which calls
> > > > amd_deferred_error_interrupt() ?
> > >
> > > Sorry picking this up after sometime. I thought I had replied to this email.
> > > Yes it is the same deferred interrupt which calls
> > amd_deferred_error_interrupt().
> > >
> >
> > Mukul,
> >
> > Do you expect that the driver will need to mark pages with high
> > correctable error counts as bad? I think the hardware folks may want
> > the GPU memory errors to be handled more aggressively than CPU memory
> > errors. The specific threshold may change from product to product, so
> > it may make sense to hardcode this in the driver.
> >
> 
> Sorry I missed this email completely. Just saw it so responding now.
> 
> At the moment, we don't have a requirement to mark a page "bad" if there is a
> high correctable error counts.
> Our previous GPU ASICs which support RAS, also do not have such a feature.
> But you make a good point. It might be worthwhile to go and ask the hardware
> folks about it.
> 
> > We have similar functionality in the Correctable Errors Collector. But
> > enterprise users may prefer a direct approach done in the driver
> > (based on the hardware experts' guidance) instead of configuring the kernel at
> runtime.
> >
> > So I think having a separate priority may make sense if some special
> > functionality, or combination of behaviors, is needed which don't fall
> > under any exisiting things. In this case, "special functionality"
> > could be that the GPU memory needs to be handled differently than CPU
> memory.
> >
> > Another thing is that this behavior is similar to the NFIT behavior,
> > i.e. there's a memory error on an external device that needs to be
> > handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT
> > to be generic
> > (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same
> > priority is okay, right?
> >
> With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC
> instead of creating a new priority as the code in the GPU driver is doing error
> detection and handling the uncorrectable errors.
> Not sure if that aligns with the definition of EDAC device in the kernel.
> 
> What do you think?
> 
> Regards,
> Mukul
> 

After talking to Yazen, MCE_PRIO_UC might be a better choice for the MCE priority as we are dealing
only with uncorrectable errors.
I will be sending out a v2 patch with changes to use the MCE_PRIO_UC and drop the MCE_PRIO_ACCEL and see what the feedback is.

Thanks,
Mukul

> > Thanks,
> > Yazen
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@...ts.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
> edesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfx&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C7d32897fddef448ab0
> aa08d952ecf41f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376
> 31999953383488%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=YWZz9
> OYTMOhBl4183kV5ZYj01yw0xwNj%2BjTdXejFKH8%3D&amp;reserved=0