linux-kernel - Re: [PATCH 3/6] x86/mce: Add support for new MCA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160708101452.GD3808@pd.tnic>
Date:	Fri, 8 Jul 2016 12:14:53 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Yazen Ghannam <Yazen.Ghannam@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register

On Fri, Jul 08, 2016 at 11:46:53AM +0200, Ingo Molnar wrote:
> I'm not sure I can parse that: how can a reported error have bits corrupted?

No, it is about the actual bits in memory the ECC error is generated
for. So, for example, if an ECC error reports that memory location X had
some bit flips, the syndrome value which gets reported together with
same ECC error shows which actual bits have flipped.

Here's an example from the AMD BKDG, maybe that'll make it more clear:

http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf

Go to page 246, there it says this:

"For example, assume the ECC syndrome is 03EAh. First search row EAh
for the complete syndrome. Since it is not found, search row 03h for
the complete syndrome. It is found in column 9h, so symbol 9h has the
error. Since the error bitmask indicates value 3h (0011b), bits 0 and 1
within that symbol are corrupted. Symbol 9h maps to bits 72-79, so the
corrupted bits are 72 and 73 of the line."

So you basically search the table of x8 ECC correctable syndromes, first
in row EAh (second syndrome byte) and if you don't find the complete
syndrome there, you search row 03 for it.

It is in column 9 and that means symbol 9. The symbols are 16 - one
symbol for each byte in a 128bit DRAM word + 3 special symbols for the
ECC bits.

The row number 3h is also the error bitmask, so bits 0 and 1 are the
ones which are corrupted.

Which means, when you look at the value in DRAM at the address the error
was reported, you need to go to symbol 9, that's 9*8 = 72 which means,
bits 72-79 and the first 2 in that byte are bits 72 and 73.

So if you want to correct them, you simply flip them as the syndrome
tells you that those 2 are corrupted.

Ok?

See how easy it is :-)))

> I'm fine with an add-on patch that adds a good explanation for all
> this to the code.

How about we point to that section in the BKDG? I think it is written
pretty understandably for a technical document and the example makes it
even more explicit.

:-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.