linux-kernel - Re: [PATCH 0/6] Add a per-dimm structure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4F61E35B.9000906@redhat.com>
Date:	Thu, 15 Mar 2012 09:40:59 -0300
From:	Mauro Carvalho Chehab <mchehab@...hat.com>
To:	Borislav Petkov <bp@...64.org>
CC:	Greg KH <gregkh@...uxfoundation.org>,
	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/6] Add a per-dimm structure

Em 15-03-2012 08:31, Borislav Petkov escreveu:
> On Wed, Mar 14, 2012 at 10:44:13PM -0300, Mauro Carvalho Chehab wrote:
>> As I said, that is easy to implement. The hard part would be what to do with
>> the per-csrow/per-branch error counters that exist currently at EDAC.
>>
>> From my side, I'm OK to remove them, but, as I said before, existing user tools
>> use them,
> 
> What are you talking about? Those per-rank counters should be the same
> as the per-csrow ch0 and ch1 counters...

Yes, but with your proposal, the per-csrow counters will not be added
(the equivalent of):
	/sys/devices/system/edac/mc/mc0/csrow0/ue_count
	/sys/devices/system/edac/mc/mc0/csrow0/ce_count

>> especially because UE errors aren't per-rank/per-dimm on the
>> typical case (128 bits cacheline).
> 
> It depends - if the 128 bit word comes from a single DIMM (unganged
> mode) then you have a per-rank UE.

True, and there are other types of ECC logic that would allow to identify
what DIMM/rank produced the error.

Yet, the typical case is to use two DIMMs for a 128-bits cacheline
on separate channels, due to performance improvements, and ECC chipkill
using the 128+16 bits, as it improves the probability of error correction.

>> Of course, the EDAC logic could increment multiple UE error counters
>> in such case, (meaning that an error happened on either one of the
>> affected DIMMs/Ranks) but this is a different behavior than the
>> current API.
> 
> Well, the API should be changed to accomodate such configurations.

True, but changing the propagation logic to propagate the error down
to the several DIMMs from where the error might have occurred is:

	- the opposite of the current propagation logic;

	- the opposite on how ITU-T TMN architecture and all EMS/NMS
	  implementations I'm aware with work.

So, using such propagation logic doesn't sound right to me. What I'm
saying is that, if all the driver can be sure is that the error happened
at the csrow level, it should not propagate the errors to the channel
level. 

So, I think that csrow-level counter is needed (and the equivalent
"group" counters for non-rank-based memory controllers).

Regards,
Mauro.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/