lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 13 May 2021 10:17:47 -0400
From:   Alex Deucher <alexdeucher@...il.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     "Joshi, Mukul" <Mukul.Joshi@....com>, x86-ml <x86@...nel.org>,
        "Kasiviswanathan, Harish" <Harish.Kasiviswanathan@....com>,
        lkml <linux-kernel@...r.kernel.org>,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 9:26 AM Borislav Petkov <bp@...en8.de> wrote:
>
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> > driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
>
> See how smca_get_long_name() is exported and export that function the
> same way.
>
> To save you some energy: is_smca_umc_v2() is not going to happen.
>
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
>
> An EDAC "device", as you call it, is not interested in handling UEs. If
> anything, it counts them.
>
> > It is a deferred interrupt that generates an MCE.
>
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
>
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
>
> What page is that? Normal DRAM page or a page in some special GPU memory?
>

GPU memory.

> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
>
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
>
> Looking at
>
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
>
> that's some VRAM thing so I'm guessing special memory on the GPU.
>

Yes.

> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the
> same retiring happens?

The bad pages are stored in an EEPROM on the board and the next time
the driver loads it reads the EEPROM so that it can reserve the bad
pages at init time so they don't get used again.

Alex


>
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
>
> Ok, I'm a user and I can access that information through sysfs. What can
> I do with it?
>
> > Hope it clears what "bad page retirement" is achieving.
>
> It is getting there.
>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@...ts.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ