lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY8PR11MB713495A12DE47EDC3B7C5E20892CA@CY8PR11MB7134.namprd11.prod.outlook.com>
Date:   Thu, 6 Jul 2023 14:11:54 +0000
From:   "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>
To:     "Luck, Tony" <tony.luck@...el.com>,
        Koba Ko <koba.ko@...onical.com>,
        Kai-Heng Feng <kai.heng.feng@...onical.com>
CC:     Markus Elfring <Markus.Elfring@....de>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "kernel-janitors@...r.kernel.org" <kernel-janitors@...r.kernel.org>,
        Borislav Petkov <bp@...en8.de>,
        "James Morse" <james.morse@....com>,
        Mauro Carvalho Chehab <mchehab@...nel.org>,
        Robert Richter <rric@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative

> From: Luck, Tony <tony.luck@...el.com>
> Sent: Wednesday, July 5, 2023 11:22 PM
> ...
> Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> 
> >> # head /proc/cpuinfo
> 
> This shows your system is the workstation version of Sapphire rapids. I don't
> think we did any validation of the EDAC driver against this model.

No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html

We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before. 

> > # dmidecode -t 17
> 
> You have just one 16GB DIMM, and EDAC found that. So despite the messy
> warnings, EDAC should be working for you.
> 
> > # lspci
> 
> I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> rapids?
> Maybe it has some clues so the EDAC driver will know not to look for non-
> existent memory controllers.

This Sapphire Rapids workstation had 2 memory controllers but appeared 
4 memory controller PCIe devices from the log:

    0000:fe:0c.0 1101: 8086:324a
    0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
    0000:fe:0e.0 1101: 8086:324a
    0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear

By observing that the MMIO registers of these absent
memory controllers consistently hold the value of ~0.
We may identify a memory controller as absent by checking
if its MMIO register "mcmtr" == ~0 in all its channels.

I made a patch below to skip all these absent memory controllers
https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u
@Koba Ko, could you please verify the patch from the link above on your workstation? Thanks! 

BTW,
Kai-Heng Feng also found the same issue before:
https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2

- Qiuxu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ