[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F984611.8040802@redhat.com>
Date: Wed, 25 Apr 2012 15:44:33 -0300
From: Mauro Carvalho Chehab <mchehab@...hat.com>
To: "Luck, Tony" <tony.luck@...el.com>
CC: Borislav Petkov <bp@...64.org>,
Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Doug Thompson <norsk5@...oo.com>
Subject: Re: [EDAC PATCH v13 6/7] edac.h: Prepare to handle with generic layers
Em 25-04-2012 15:32, Luck, Tony escreveu:
>> See the driver: the only useful information provided by the MCA log is
>> that an error happened, their physical address, and the type of the
>> error. Unlikely the Nehalem MCA, the MCE_MISC registers won't point to the
>> DIMM in the error.
>
> There's a bit more information in the MCA log than just the physical address:
>
> The cpu number that finds the data in its bank will provide socket information.
> [/proc/cpuinfo maps logical cpu numbers to "physical id"]
Yes, but this seems to be different than the CPU that actually has the memory
controller. The MCA registers have a bit to mark if the the error is at the
same CPU or on another one. So, when there's just 2 CPU (sockets), this could
be used, but, for more than 2 CPUs, this field is useless.
So, I opted to not trust on it.
> Low order bits of the MCi_STATUS register will give the channel. See the SDM.
On all tests I did, the channel information reported via MCi_status didn't
match the channel reported via the decoding logic. Maybe this might be due
to some bug on the pre-release CPUs I used so far.
> So the only missing information from the MCA log is which DIMM within
> the channel. I.e. we can pin the fault to a group of either two or
> three DIMMs depending on how many DIMMS/channel the motherboard supports.
>
> If you only have one DIMM per channel populated than socket/channel is
> sufficient to identify the DIMM.
>
> [We also don't have any intra-DIMM information for those customers who
> would like to diagnose the device on the DIMM, or which bits within
> the cache line had the error]
>
> -Tony
> --
> To unsubscribe from this list: send the line "unsubscribe linux-edac" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists