linux-kernel - RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3908561D78D1C84285E8C5FCA982C28F3292AE0D@ORSMSX114.amr.corp.intel.com>
Date:	Wed, 12 Nov 2014 17:17:55 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...en8.de>,
	Andy Lutomirski <luto@...capital.net>
CC:	Andi Kleen <andi@...stfloor.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	X86 ML <x86@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
	Oleg Nesterov <oleg@...hat.com>
Subject: RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from
 userspace

> Not that easy for testing the #MC path - there we have to inject real
> MCEs and then noodle through the memory_failure() code. I'd be very much
> interested to see what would happen if two MCEs happen back-to-back with
> your change, the second one being raised when we're on the kernel stack
> and in memory_failure()...

If the second one hits before we clear MCG_STATUS, then the processor resets.

If the second one is caused by the recovery thread somewhere in memory_failure(),
then Andy won't switch stacks - but we will declare this a fatal error an panic (we have
no recovery from machine checks in the kernel).

Otherwise the memory_failure() thread is the innocent bystander. If the affected thread
decides to do recovery, then the first thread will be allowed to return and continue.

I might worry a bit if the second error is another thread hitting the *same* page which
hasn't finished processing yet ... then the second will chase along behind the first trying
to fix the same problem.  I *think* the first will complete and the second will just end
up here:

	if (TestSetPageHWPoison(p)) {
		printk(KERN_ERR "MCE %#lx: already hardware poisoned\n", pfn);
		return 0;
	}

which is really early in memory_failure().

-Tony