[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJB-X+X_KW=T4WOe2AS3SFFQKjt7VcQRFUCGYFcjipi5-aXdrw@mail.gmail.com>
Date: Fri, 7 Jul 2023 01:40:40 +0800
From: Koba Ko <koba.ko@...onical.com>
To: "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>
Cc: "Luck, Tony" <tony.luck@...el.com>,
Kai-Heng Feng <kai.heng.feng@...onical.com>,
Markus Elfring <Markus.Elfring@....de>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"kernel-janitors@...r.kernel.org" <kernel-janitors@...r.kernel.org>,
Borislav Petkov <bp@...en8.de>,
James Morse <james.morse@....com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Robert Richter <rric@...nel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] EDAC/i10nm: shift exponent is negative
On Thu, Jul 6, 2023 at 10:12 PM Zhuo, Qiuxu <qiuxu.zhuo@...el.com> wrote:
>
> > From: Luck, Tony <tony.luck@...el.com>
> > Sent: Wednesday, July 5, 2023 11:22 PM
> > ...
> > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> >
> > >> # head /proc/cpuinfo
> >
> > This shows your system is the workstation version of Sapphire rapids. I don't
> > think we did any validation of the EDAC driver against this model.
>
> No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
> From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
> https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html
>
> We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before.
>
> > > # dmidecode -t 17
> >
> > You have just one 16GB DIMM, and EDAC found that. So despite the messy
> > warnings, EDAC should be working for you.
> >
> > > # lspci
> >
> > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> > rapids?
> > Maybe it has some clues so the EDAC driver will know not to look for non-
> > existent memory controllers.
>
> This Sapphire Rapids workstation had 2 memory controllers but appeared
> 4 memory controller PCIe devices from the log:
>
> 0000:fe:0c.0 1101: 8086:324a
> 0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
> 0000:fe:0e.0 1101: 8086:324a
> 0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear
>
> By observing that the MMIO registers of these absent
> memory controllers consistently hold the value of ~0.
> We may identify a memory controller as absent by checking
> if its MMIO register "mcmtr" == ~0 in all its channels.
>
> I made a patch below to skip all these absent memory controllers
> https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u
> @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks!
Here's dmesg patched(Ref. 1). didn't find the previous message,
`EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`
Ref. 1, https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmVlp?usp=sharing
>
> BTW,
> Kai-Heng Feng also found the same issue before:
> https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
>
> - Qiuxu
Powered by blists - more mailing lists