linux-kernel - Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20101025134740.GA8888@elte.hu>
Date:	Mon, 25 Oct 2010 15:47:40 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	Huang Ying <ying.huang@...el.com>, Len Brown <lenb@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
	Borislav Petkov <petkovbb@...glemail.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>, Don Zickus <dzickus@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mauro Carvalho Chehab <mchehab@...hat.com>,
	Arjan van de Ven <arjan@...radead.org>
Subject: Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error
 Source POLL/IRQ/NMI notification type support

* Andi Kleen <andi@...stfloor.org> wrote:

> On Mon, Oct 25, 2010 at 02:55:31PM +0200, Ingo Molnar wrote:
> > 
> > * Andi Kleen <andi@...stfloor.org> wrote:
> > 
> > > On Mon, Oct 25, 2010 at 01:15:30PM +0200, Ingo Molnar wrote:
> > > 
> > > > > > > einj.c: it's about the 3rd separate 'error injection' concept that got 
> > > > > > > introduced ...
> > > > > > 
> > > > > > EINJ is a true platform feature, not just software feature. We need to support 
> > > > > > it to debug various hardware error features.
> > > > > 
> > > > > Also having multiple error injecting interfaces is a good thing.
> > > > 
> > > > It's never a good thing to have separate, vendor dependent interfaces for what 
> > > > to the user is basically the same conceptual thing!
> > > 
> > > Perhaps a simple example (simplified, in practice there are more complications) 
> > > makes it more clear:
> > > 
> > > The memory error handler does different actions depending on what the state the 
> > > page the error is happening on is in.
> > 
> > What you appear to be arguing for is the ability to inject different types of 
> > events.
> 
> Different events in different contexts with different drivers with different 
> parameters [...]

Correct.

> [...] using different tools.

That's possible, but i'd expect tools/ras/ to be populated with uniformly working 
tools. There's little sense in fragmenting the hw-testing field...

> Commonality: about 0% exept there's "error" somewhere in the description.

Wrong. Their main purpose is common: they are events attached to existing hardware 
topologies, which events can be configured, which events can be received and which 
can be injected with attributes for rare-event simulation purposes.

The tool people have spoken to us clear and loud that they want to _receive_ events 
in a unified and structured way - not via lots of separate ABIs from facilities that 
have mismatching capabilities.

We want to be able to inject _other_ events as well, not just hw-error ones - 
especially rare ones.

I.e. there's clear, demonstrated, patches-pending demand for uniformity and there's 
similar demand for a broader concept.

You are now making the point that somehow the receipt and sending/injecting of 'hw 
errors on Intel hardware' should be a separate, fragmented, disoriented, messy piece 
of interface design, closely matching some ACPI spec detail, which should be 
disassociated from the preferred mechanism of error reporting?

Your argument makes absolutely no sense to me.

The kernel is an abstraction machine: common hw aspects should be generalized to the 
extent it makes sense, with reasonable extensions for anything we dont want (or 
cannot) generalize.

There's _tons_ of interesting structure here to be taken advantage of: just look at 
what Boris is trying to achieve with his EDAC tooling patches. See what Lin Ming is 
trying to do by moving event descriptors to /sys, so that events can come with 
elements of our hw and sw topology in a natural way.

There is absolutely no justification whatsoever for the new /dev/erst-dbg ABI ...

Furthermore, you have ignored my other argument for the second time now: why does 
this code not do the event _reporting_ via the facilities we use and prefer? As far 
as users are concerned, the ability to receive hardware error events in a unified 
way is an even more important aspect than the matter of event injection.

Once you do that i think you will see how naturally error injection fits into the 
picture as well. It is an aspect of pretty much any event (not just hw-error events) 
that we want to be able to 'inject/simulate' them, to test tools.

Your refusal to even consider this possibility and to look at the EDAC/RAS patches 
that deal with this is puzzling to me.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/