lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 29 Jul 2021 23:59:43 +0000
From:   "Joshi, Mukul" <Mukul.Joshi@....com>
To:     "Ghannam, Yazen" <Yazen.Ghannam@....com>
CC:     Borislav Petkov <bp@...en8.de>,
        Alex Deucher <alexdeucher@...il.com>, x86-ml <x86@...nel.org>,
        "Kasiviswanathan, Harish" <Harish.Kasiviswanathan@....com>,
        lkml <linux-kernel@...r.kernel.org>,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only]



> -----Original Message-----
> From: Ghannam, Yazen <Yazen.Ghannam@....com>
> Sent: Thursday, June 3, 2021 5:13 PM
> To: Joshi, Mukul <Mukul.Joshi@....com>
> Cc: Borislav Petkov <bp@...en8.de>; Alex Deucher <alexdeucher@...il.com>;
> x86-ml <x86@...nel.org>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@....com>; lkml <linux-kernel@...r.kernel.org>;
> amd-gfx@...ts.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> ...
> > > Is that the same deferred interrupt which calls
> > > amd_deferred_error_interrupt() ?
> >
> > Sorry picking this up after sometime. I thought I had replied to this email.
> > Yes it is the same deferred interrupt which calls
> amd_deferred_error_interrupt().
> >
> 
> Mukul,
> 
> Do you expect that the driver will need to mark pages with high correctable
> error counts as bad? I think the hardware folks may want the GPU memory
> errors to be handled more aggressively than CPU memory errors. The specific
> threshold may change from product to product, so it may make sense to
> hardcode this in the driver.
> 

Sorry I missed this email completely. Just saw it so responding now.

At the moment, we don't have a requirement to mark a page "bad" if there is a high correctable error counts. 
Our previous GPU ASICs which support RAS, also do not have such a feature.
But you make a good point. It might be worthwhile to go and ask the hardware folks about it.

> We have similar functionality in the Correctable Errors Collector. But enterprise
> users may prefer a direct approach done in the driver (based on the hardware
> experts' guidance) instead of configuring the kernel at runtime.
> 
> So I think having a separate priority may make sense if some special
> functionality, or combination of behaviors, is needed which don't fall under any
> exisiting things. In this case, "special functionality" could be that the GPU
> memory needs to be handled differently than CPU memory.
> 
> Another thing is that this behavior is similar to the NFIT behavior, i.e. there's a
> memory error on an external device that needs to be handled by the device's
> driver. So maybe we can rename MCE_PRIO_NFIT to be generic
> (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same priority
> is okay, right?
> 
With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC instead of creating a new priority as the code in the GPU driver is doing error detection and handling the uncorrectable errors.
Not sure if that aligns with the definition of EDAC device in the kernel.

What do you think?

Regards,
Mukul

> Thanks,
> Yazen

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ