lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 8 Jul 2016 12:14:53 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Yazen Ghannam <Yazen.Ghannam@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register

On Fri, Jul 08, 2016 at 11:46:53AM +0200, Ingo Molnar wrote:
> I'm not sure I can parse that: how can a reported error have bits corrupted?

No, it is about the actual bits in memory the ECC error is generated
for. So, for example, if an ECC error reports that memory location X had
some bit flips, the syndrome value which gets reported together with
same ECC error shows which actual bits have flipped.

Here's an example from the AMD BKDG, maybe that'll make it more clear:

http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf

Go to page 246, there it says this:

"For example, assume the ECC syndrome is 03EAh. First search row EAh
for the complete syndrome. Since it is not found, search row 03h for
the complete syndrome. It is found in column 9h, so symbol 9h has the
error. Since the error bitmask indicates value 3h (0011b), bits 0 and 1
within that symbol are corrupted. Symbol 9h maps to bits 72-79, so the
corrupted bits are 72 and 73 of the line."

So you basically search the table of x8 ECC correctable syndromes, first
in row EAh (second syndrome byte) and if you don't find the complete
syndrome there, you search row 03 for it.

It is in column 9 and that means symbol 9. The symbols are 16 - one
symbol for each byte in a 128bit DRAM word + 3 special symbols for the
ECC bits.

The row number 3h is also the error bitmask, so bits 0 and 1 are the
ones which are corrupted.

Which means, when you look at the value in DRAM at the address the error
was reported, you need to go to symbol 9, that's 9*8 = 72 which means,
bits 72-79 and the first 2 in that byte are bits 72 and 73.

So if you want to correct them, you simply flip them as the syndrome
tells you that those 2 are corrupted.

Ok?

See how easy it is :-)))

> I'm fine with an add-on patch that adds a good explanation for all
> this to the code.

How about we point to that section in the BKDG? I think it is written
pretty understandably for a technical document and the example makes it
even more explicit.

:-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ