linux-kernel - 64bit x86: NMI nesting still buggy?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LNX.2.00.1404291150200.8903@pobox.suse.cz>
Date:	Tue, 29 Apr 2014 15:05:55 +0200 (CEST)
From:	Jiri Kosina <jkosina@...e.cz>
To:	Steven Rostedt <rostedt@...dmis.org>,
	"H. Peter Anvin" <hpa@...ux.intel.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
cc:	linux-kernel@...r.kernel.org, x86@...nel.org,
	Salman Qazi <sqazi@...gle.com>, Ingo Molnar <mingo@...e.hu>,
	Michal Hocko <mhocko@...e.cz>, Borislav Petkov <bp@...en8.de>,
	Vojtech Pavlik <vojtech@...e.cz>,
	Petr Tesarik <ptesarik@...e.cz>, Petr Mladek <pmladek@...e.cz>
Subject: 64bit x86: NMI nesting still buggy?

Hi,

so while debugging some hard-to-explain hangs in the past, we have been 
going around in circles around the NMI nesting disaster, and I tend to 
believe that Steven's fixup (for most part introduced in 3f3c8b8c ("x86: 
Add workaround to NMI iret woes")) makes the race *much* smaller, but it 
doesn't fix it completely (it basically reduces the race to a few 
instructions in first_nmi which are doing the stack preparatory work).

According to 38.4 of [1], when SMM mode is entered while the CPU is 
handling NMI, the end result might be that upon exit from SMM, NMIs will 
be re-enabled and latched NMI delivered as nested [2].

This is handled well by playing the frame-saving and flag-setting games in 
`first_nmi' / `nested_nmi' / `repeat_nmi' (and that also works flawlessly 
in cases exception or breakpoint triggers some time later during NMI 
handling when all the 'nested' setup has been done).

There is unfortunately small race window, which, I believe, is not covered 
by this.

	- 1st NMI triggers
	- SMM is entered very shortly afterwards, even before `first_nmi' 
	  was able to do its job
	- 2nd NMI is latched
	- SMM exits with NMIs re-enabled (see [2]) and 2nd NMI triggers
	- 2nd NMI gets handled properly, exits with iret
	- iret returns to the place where 1st NMI was interrupted, but 
	  the return address on the stack where iret from 1st NMI should 
	  eventually return to is gone, and the 'saved/copy' locations of 
	  the stack don't contain the correct frame either

The race is very small and it's hard to trigger SMM in a deterministic 
way, so it's probably very difficult to trigger. But I wouldn't be 
surprised if it'd trigger ocassionally in the wild, and the resulting 
problems were never root-caused (as the problem is very rare, not 
reproducible, probably doesn't happen on the same system more than once in 
a lifetime).

We were not able to come up with any other fix than avoiding using IST 
completely on x86_64, and instead going back to stack switching in 
software -- the same way 32bit x86 does.

So basically, I have two questions:

(1) is the above analysis correct? (if not, why?)
(2) if it is correct, is there any other option for fix than avoiding 
    using IST for exception stack switching, and having kernel do the 
    legacy task switching (the same way x86_32 is doing)?

[1] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

[2] 	"A special case can occur if an SMI handler nests inside an NMI 
	 handler and then another NMI occurs. During NMI interrupt 
	 handling, NMI interrupts are disabled, so normally NMI interrupts 
 	 are serviced and completed with an IRET instruction one at a 
	 time. When the processor enters SMM while executing an NMI 
	 handler, the processor saves the SMRAM state save map but does 
	 not save the attribute to keep NMI interrupts disabled. 
	 Potentially, an NMI could be latched (while in SMM or upon exit) 
	 and serviced upon exit of SMM even though the previous NMI  
	 handler has still not completed."

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/