[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY8PR11MB713495A12DE47EDC3B7C5E20892CA@CY8PR11MB7134.namprd11.prod.outlook.com>
Date: Thu, 6 Jul 2023 14:11:54 +0000
From: "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>,
Koba Ko <koba.ko@...onical.com>,
Kai-Heng Feng <kai.heng.feng@...onical.com>
CC: Markus Elfring <Markus.Elfring@....de>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"kernel-janitors@...r.kernel.org" <kernel-janitors@...r.kernel.org>,
Borislav Petkov <bp@...en8.de>,
"James Morse" <james.morse@....com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Robert Richter <rric@...nel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> From: Luck, Tony <tony.luck@...el.com>
> Sent: Wednesday, July 5, 2023 11:22 PM
> ...
> Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
>
> >> # head /proc/cpuinfo
>
> This shows your system is the workstation version of Sapphire rapids. I don't
> think we did any validation of the EDAC driver against this model.
No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html
We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before.
> > # dmidecode -t 17
>
> You have just one 16GB DIMM, and EDAC found that. So despite the messy
> warnings, EDAC should be working for you.
>
> > # lspci
>
> I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> rapids?
> Maybe it has some clues so the EDAC driver will know not to look for non-
> existent memory controllers.
This Sapphire Rapids workstation had 2 memory controllers but appeared
4 memory controller PCIe devices from the log:
0000:fe:0c.0 1101: 8086:324a
0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
0000:fe:0e.0 1101: 8086:324a
0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear
By observing that the MMIO registers of these absent
memory controllers consistently hold the value of ~0.
We may identify a memory controller as absent by checking
if its MMIO register "mcmtr" == ~0 in all its channels.
I made a patch below to skip all these absent memory controllers
https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u
@Koba Ko, could you please verify the patch from the link above on your workstation? Thanks!
BTW,
Kai-Heng Feng also found the same issue before:
https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
- Qiuxu
Powered by blists - more mailing lists