linux-kernel - RE: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <DM4PR12MB5263059B0C67BCA373AB3AEDEEA29@DM4PR12MB5263.namprd12.prod.outlook.com>
Date:   Wed, 22 Sep 2021 19:43:58 +0000
From:   "Joshi, Mukul" <Mukul.Joshi@....com>
To:     Borislav Petkov <bp@...en8.de>
CC:     "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "mchehab@...nel.org" <mchehab@...nel.org>,
        "Ghannam, Yazen" <Yazen.Ghannam@....com>,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>
Subject: RE: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

[AMD Official Use Only]



> -----Original Message-----
> From: Borislav Petkov <bp@...en8.de>
> Sent: Wednesday, September 22, 2021 7:41 AM
> To: Joshi, Mukul <Mukul.Joshi@....com>
> Cc: linux-edac@...r.kernel.org; x86@...nel.org; linux-kernel@...r.kernel.org;
> mingo@...hat.com; mchehab@...nel.org; Ghannam, Yazen
> <Yazen.Ghannam@....com>; amd-gfx@...ts.freedesktop.org
> Subject: Re: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran
> RAS
> 
> [CAUTION: External Email]
> 
> On Sun, Sep 12, 2021 at 10:13:11PM -0400, Mukul Joshi wrote:
> > On Aldebaran, GPU driver will handle bad page retirement even though
> > UMC is host managed. As a result, register a bad page retirement
> > handler on the mce notifier chain to retire bad pages on Aldebaran.
> >
> > v1->v2:
> > - Use smca_get_bank_type() to determine MCA bank.
> > - Envelope the changes under #ifdef CONFIG_X86_MCE_AMD.
> > - Use MCE_PRIORITY_UC instead of MCE_PRIO_ACCEL as we are
> >   only handling uncorrectable errors.
> > - Use macros to determine UMC instance and channel instance
> >   where the uncorrectable error occured.
> > - Update the headline.
> 
> Same note as for the previous patch.
> 
Acked.

> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > +                                 unsigned long val, void *data) {
> > +     struct mce *m = (struct mce *)data;
> > +     struct amdgpu_device *adev = NULL;
> > +     uint32_t gpu_id = 0;
> > +     uint32_t umc_inst = 0;
> > +     uint32_t ch_inst, channel_index = 0;
> > +     struct ras_err_data err_data = {0, 0, 0, NULL};
> > +     struct eeprom_table_record err_rec;
> > +     uint64_t retired_page;
> > +
> > +     /*
> > +      * If the error was generated in UMC_V2, which belongs to GPU UMCs,
> > +      * and error occurred in DramECC (Extended error code = 0) then only
> > +      * process the error, else bail out.
> > +      */
> > +     if (!m || !((smca_get_bank_type(m->bank) == SMCA_UMC_V2) &&
> > +                 (XEC(m->status, 0x1f) == 0x0)))
> > +             return NOTIFY_DONE;
> > +
> > +     /*
> > +      * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > +      */
> > +     gpu_id = GET_MCA_IPID_GPUID(m->ipid) - GPU_ID_OFFSET;
> > +
> > +     adev = find_adev(gpu_id);
> > +     if (!adev) {
> > +             dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > +                                  __func__, gpu_id);
> > +             return NOTIFY_DONE;
> > +     }
> > +
> > +     /*
> > +      * If it is correctable error, return.
> > +      */
> > +     if (mce_is_correctable(m)) {
> > +             return NOTIFY_OK;
> > +     }
> 
> This can run before you find_adev().
> 
Acked.

> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > +     /*
> > +      * Register the x86 notifier only once
> > +      * with MCE subsystem.
> > +      */
> > +     if (notifier_registered == false) {
> > +             mce_register_decode_chain(&amdgpu_bad_page_nb);
> > +             notifier_registered = true;
> > +     }
> 
> I have a patchset which will get rid of the need to do this silliness - only if I had
> some time to actually prepare it for submission... :-\
> 
:-\

Thank you.

Regards,
Mukul
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C7ae9c87153f7
> 4572712908d97dbdcc7a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637679076423976867%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =M5UDrSea0lnEi5%2BhU4ck0zk1dZD9kX4DUoXt95J6dJ4%3D&amp;reserved=0