linux-kernel - Re: x86/mce merge, integration hickup + crash, design thoughts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b3ece790901141132u28ba2482h2e7af7bd51224f2a@mail.gmail.com>
Date:	Wed, 14 Jan 2009 11:32:51 -0800
From:	Tim Hockin <thockin@...il.com>
To:	Andi Kleen <ak@...ux.intel.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Thomas Gleixner <tglx@...utronix.de>,
	linux-kernel@...r.kernel.org, "H. Peter Anvin" <hpa@...or.com>,
	ying.huang@...el.com, Aaron Durbin <adurbin@...il.com>,
	priyankag@...gle.com
Subject: Re: x86/mce merge, integration hickup + crash, design thoughts

On Wed, Jan 14, 2009 at 10:05 AM, Andi Kleen <ak@...ux.intel.com> wrote:
>
>>
>> From my point of view: a single, consistent, easy logging interface
>> for the kernel to send *structured data* about hardware/system events
>> and errors up to userspace.
>
> Which kinds of events were you thinking of?
>
> So far we managed by cramming some other CPU events like thermal
> trip into "pseudo banks" in struct mce. Admittedly it's not the
> most pretty solution in the world, but it worked.

Yeah, no offense, but that's horrible :)

Ideally, I'd rather see a more generic conduit for all sorts of
events.  Polled and exception MCEs.  Thermal interrupts.  MCE
threshold interrupts.  EDAC polled errors.  PCI-express errors.  SATA
disk timeouts.

Now I know there are different conduits for some events - netlink
tells me about netif link up/down events I think.  I would settle for
a small number of interfaces.  What I don't want is what we have today
- EVERYTHING has a different interface.  Some are poll()-able.  Some
have to be actively polled.  Some have to have a daemon listening or
else messages are dropped.  Some have to parse logs.  Puke.

Put it this way:  Given a thousand machines, I want to gather,
collate, and correlate all these events.  I want to be able to produce
a "life story" of sorts for a machine and for a data center.  Once I
can do that, I can start to make predictive diagnoses more accurately,
and I can know how much these things actually COST us.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/