lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 15 Jun 2009 09:09:14 +0200
From:	Andi Kleen <andi@...stfloor.org>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Wu Fengguang <fengguang.wu@...el.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
	Mel Gorman <mel@....ul.ie>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andi Kleen <andi@...stfloor.org>,
	"riel@...hat.com" <riel@...hat.com>,
	"chris.mason@...cle.com" <chris.mason@...cle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > 
> > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> > 
> > It comes a long way through numerous reviews, and I believe all the
> > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > your request to "die hard" or "die gracefully" or whatever on memory
> > failure events?
> 
> Uevent? As in, send a message to userspace? I don't think this
> would be ideal for a fail-stop/failover situation.

Agreed.

For failover you typically want a application level heartbeat anyways
to guard against user space software problems and if there's a kill then it
would catch it. Also again in you want to check against all corruptions you
have to do it in the low level handler or better watch corrected
events too to predict failures (but the later is quite hard to do generally). 
To some extent the first is already implemented on x86, e.g. set
the tolerance level to 0 will give more aggressive panics.

> I can't see a good reason to rush to merge it.

The low level x86 code for MCA recovery is in, just this high level
part is missing to kill the correct process. I think it would be good to merge 
a core now.  The basic code seems to be also as well tested as we can do it 
right now and exposing it to more users would be good. It's undoubtedly not 
perfect yet, but that's not a requirement for merge.

There's a lot of fancy stuff that could be done in addition,
but that's not really needed right now and for a lot of the fancy
ideas (I have enough on my own :) it's dubious they are actually
worth it.

> IMO the userspace-visible changes have maybe not been considered
> too thoroughly, which is what I'd be most worried about. I probably
> missed seeing documentation of exact semantics and situations
> where admins should tune things one way or the other.

There's only a single tunable anyways, early kill vs late kill.

For KVM you need early kill, for the others it remains to be seen.

> I hope it is going to be merged with an easy-to-use fault injector,
> because that is the only way Joe kernel developer is ever going to
> test it.

See patches 13 and 14. In addition there's another low level x86
injector too.

There's also a test suite available (mce-test on kernel.org git)

-Andi
-- 
ak@...ux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ