linux-kernel - Re: [RFC PATCH 00/21 v2] amd64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090501123956.GA4225@elte.hu>
Date:	Fri, 1 May 2009 14:39:56 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	Borislav Petkov <borislav.petkov@....com>,
	akpm@...ux-foundation.org, greg@...ah.com, tglx@...utronix.de,
	hpa@...or.com, dougthompson@...ssion.com,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 00/21 v2] amd64_edac: EDAC module for AMD64

* Andi Kleen <andi@...stfloor.org> wrote:

> > Kconfig, mce code delivers needed error info to edac which, in 
> > turn, goes and decodes the error/does the mapping to DIMM 
> > blocks/supplies DRAM error injection facility for testing 
> > purposes and similar things. That way you have both and they 
> > don't overlap in functionality.
> 
> You can do that, but it's redundant because mcelog can do this 
> this already. [...]

The thing is, when we took up x86 maintenance i had a good look at 
the MCE situation, i checked both the kernel and the user-space 
side.

The kernel side MCE code was in pretty bad shape to begin with, but 
mcelog (the user-space tool) is a big stinking pile of poo on every 
level.

It's one of the worst piece of kernel related code i ever saw. I 
think you wrote all of it, and you should be ashamed of that code, 
and you should be ashamed of the design and you should be ashamed of 
the concept.

It even came with its own 'database' code: mcelog*/db.[ch] is 600+ 
lines of needless code instead of obvious library use. It's NIH and 
self-serving complexity all over.

And the thing is, mcelog/mcedecode never really _did_ anything real 
an useful, other than to:

 1) Confuse kernel users who see a fatal MCE panic, with cryptic, 
    quirky codes, who write that down on paper, then run it through 
    the user-space tool - just to see a piece of information the 
    kernel could have provided already. (if they didnt make any 
    mistakes while writing down the codes)

 2) Decode a quirky, binary MCE record and combine it with DMI data.
    (which the kernel can and should do just fine.)

Yes, i know about tolerant=3 and certain people/companies opting to 
ignore MCE fatality levels and live dangerously (and i also know 
about non-fatal reporting and correction extensions in hw) - but for 
99.999% of the Linux users the whole thing is just needless 
complexity today, that does not offer anything valuable.

And that is really what happens when code is misdesigned and the 
wrong pieces of code are pushed to user-space: a crappy, limited ABI 
and an under-maintained, big pile of junk user-space kit.

The obvious truth is that hardware faults have to be caught, decoded 
and optionally handled by the kernel.

The EDAC code at least has a sane design: it realizes that hardware 
faults _must_ be fully known, decoded and potentially handled in the 
kernel.

Piggyback-ing to user-space is plain idiotic and not defensible. So 
if a piece of hardware capability is handled by the EDAC code, the 
x86 MCE code will step aside and will stay the heck out of that 
business. At least until the two concepts are merged into some sane 
kernel hardware fault logging and handling framework.

And Andi, until you dont grasp such _basic_ design concepts, you 
have no business writing such code really. You should stay the heck 
away from it and you should stop 'advising' people who made the 
right calls while you messed up. It is mcelog that is crap, not the 
EDAC code.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/