linux-kernel - Re: Sysenter crash with Nested Task Bit set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0609180841520.4388@g5.osdl.org>
Date:	Mon, 18 Sep 2006 09:02:51 -0700 (PDT)
From:	Linus Torvalds <torvalds@...l.org>
To:	Andi Kleen <ak@...e.de>
cc:	Andrew Morton <akpm@...l.org>,
	Chuck Ebbert <76306.1226@...puserve.com>,
	In Cognito <defend.the.world@...il.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...e.hu>, bcrl@...ck.org
Subject: Re: Sysenter crash with Nested Task Bit set



On Mon, 18 Sep 2006, Andi Kleen wrote:
> 
> > If we fix it in the task-switch code, we shouldn't need any other changes 
> > (ie Chuck's change is unnecessary too), because then the process that sets 
> > NT will happily die (with NT set), but switch away to something else and 
> > nobody else will be affected.
> 
> Won't it die in the kernel with an oops on the next interrupt?

No. As mentioned, "sysenter" is really special. It should really be the 
_only_ entry into the kernel that doesn't change eflags. Everything else 
clears NT (and most of them do other things too).

So if a process already runs in kernel mode (or used mode, for that 
matter) with NT set, and an interrupt happens, the act of taking the 
interrupt will clear NT, and nothing bad happens at all. In fact, we very 
much depend on this, exactly because otherwise user mode could just set NT 
and just wait for an interrupt, and bad things would happen. They 
obviously don't.

So it's literally _only_ the path of

	set NT
	sysenter
	...
	iret

that causes problems, because all other paths will clear NT on entry into 
the kernel.

> > So if I'm right, then this patch _should_ fix it. UNTESTED (and the 
> > "ref_from_fork" special case doesn't clear NT, so it's strictly incompete, 
> > but maybe somebody can test this?)
> 
> Are you sure this handles interrupts or nested syscalls 
> before the context switch correctly?

Yeah, see above. And I have now even tested it slightly (ie I ran one of 
my x86 machines with that patch).

> I think it really needs to be handled in the sysenter path.

It really would be much more expensive there (well, the expense would be 
the same, but any load that have any amount of either would tend to have 
many more system calls than context switches).

The only way to have more context switches than system calls is to run 
entirely in user space all the time, and then we don't care - the context 
switches will also be so rare that the extra cycles simply don't matter.

> > Andi? I don't know if x86-64 honors NT in 64-bit mode, but if it does, it 
> > needs something similar (assuming this works).
> 
> It doesn't task switch, but you would get a #GP in IRET at least.
> Leaking that to another process is definitely not good.

Right. Then you need that exact same thing on x86-64 too.

One final note: as I already mentioned, this isn't actually entirely 
sufficient. There's the magic special case of "switch to a newly created 
thread", which jumps to "ret_from_fork" rather than staying within that 
small area. We'll need to add "clear NT" there.

So this (UNTESTED - I tested the previous version, and it works, but this 
extends on it) second patch should be more complete. It handles the case 
where the NT-dirty task context switches to a newly created task, by just 
forcing "eflags" to a known value in the newly created task, rather than 
whatever value it had at the time of the context switch.

The addition is fairly obvious, but maybe I screwed something up, so buyer 
beware...

		Linus

---
diff --git a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
index 37a7d2e..87f9f60 100644
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -209,6 +209,10 @@ ENTRY(ret_from_fork)
 	GET_THREAD_INFO(%ebp)
 	popl %eax
 	CFI_ADJUST_CFA_OFFSET -4
+	pushl $0x0202			# Reset kernel eflags
+	CFI_ADJUST_CFA_OFFSET 4
+	popfl
+	CFI_ADJUST_CFA_OFFSET -4
 	jmp syscall_exit
 	CFI_ENDPROC
 
diff --git a/include/asm-i386/system.h b/include/asm-i386/system.h
index 49928eb..defbf12 100644
--- a/include/asm-i386/system.h
+++ b/include/asm-i386/system.h
@@ -13,7 +13,8 @@ extern struct task_struct * FASTCALL(__s
 
 #define switch_to(prev,next,last) do {					\
 	unsigned long esi,edi;						\
-	asm volatile("pushl %%ebp\n\t"					\
+	asm volatile("pushfl\n\t"		/* Save flags */	\
+		     "pushl %%ebp\n\t"					\
 		     "movl %%esp,%0\n\t"	/* save ESP */		\
 		     "movl %5,%%esp\n\t"	/* restore ESP */	\
 		     "movl $1f,%1\n\t"		/* save EIP */		\
@@ -21,6 +22,7 @@ #define switch_to(prev,next,last) do {		
 		     "jmp __switch_to\n"				\
 		     "1:\t"						\
 		     "popl %%ebp\n\t"					\
+		     "popfl"						\
 		     :"=m" (prev->thread.esp),"=m" (prev->thread.eip),	\
 		      "=a" (last),"=S" (esi),"=D" (edi)			\
 		     :"m" (next->thread.esp),"m" (next->thread.eip),	\
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/