[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110529191055.GC9835@elte.hu>
Date: Sun, 29 May 2011 21:10:55 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Andy Lutomirski <luto@....EDU>
Cc: Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 4/5] x86-64: Replace vsyscall gettimeofday fallback with
int 0xcc
* Andy Lutomirski <luto@....EDU> wrote:
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1121,6 +1121,8 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
> zeroentry coprocessor_error do_coprocessor_error
> errorentry alignment_check do_alignment_check
> zeroentry simd_coprocessor_error do_simd_coprocessor_error
> +zeroentry intcc do_intcc
> +
>
> /* Reload gs selector with exception handling */
> /* edi: new selector */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
I forgot to reply to your prior question about zeroentry vs.
paranoidzeroentry.
That distinction is an undocumented x86-64-ism.
Background:
The SWAPGS instruction is rather fragile: it must nest perfectly and
only in single depth, it should only be used if entering from user
mode to kernel mode and then when returning to user-space, and
precisely so. If we mess that up even slightly, we crash.
So when we have a secondary entry, already in kernel mode, we *must
not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
not switched/swapped yet.
Now, there's a secondary complication: there's a cheap way to test
which mode the CPU is in and an expensive way.
The cheap way is to pick this info off the entry frame on the kernel
stack, from the CS of the ptregs area of the kernel stack:
xorl %ebx,%ebx
testl $3,CS+8(%rsp)
je error_kernelspace
SWAPGS
The expensive (paranoid) way is to read back the MSR_GS_BASE value
(which is what SWAPGS modifies):
movl $1,%ebx
movl $MSR_GS_BASE,%ecx
rdmsr
testl %edx,%edx
js 1f /* negative -> in kernel */
SWAPGS
xorl %ebx,%ebx
1: ret
and the whole paranoid non-paranoid macro complexity is about whether
to suffer that RDMSR cost.
If we are at an interrupt or user-trap/gate-alike boundary then we
can use the faster check: the stack will be a reliable indicator of
whether SWAPGS was already done: if we see that we are a secondary
entry interrupting kernel mode execution, then we know that the GS
base has already been switched. If it says that we interrupted
user-space execution then we must do the SWAPGS.
But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry
context, which might have triggered right after a normal entry wrote
CS to the stack but before we executed SWAPGS, then the only safe way
to check for GS is the slower method: the RDMSR.
So we try only to mark those entry methods 'paranoid' that absolutely
need the more expensive check for the GS base - and we generate all
'normal' entry points with the regular (faster) entry macros.
I hope this explains!
All in one, your zeroentry choice should be fine: INT 0xCC will not
issue in NMI context.
Btw, as a sidenote, and since you are already touching this code,
would you be interested in putting this explanation into the source
code? It's certainly not obvious and whoever wrote those macros did
not think of documenting them for later generations ;-)
> +++ b/arch/x86/kernel/traps.c
> @@ -872,6 +872,10 @@ void __init trap_init(void)
> set_bit(SYSCALL_VECTOR, used_vectors);
> #endif
>
> + set_system_intr_gate(0xCC, &intcc);
> + set_bit(0xCC, used_vectors);
> + printk(KERN_ERR "intcc gate isntalled\n");
I think you mentioned it but i cannot remember your reasoning why you
marked it 0xcc (and not closer to the existing syscall vector) -
please add a comment about it into the source code as well.
Ok, i suspect you marked it 0xCC because that's the INT3 instruction
- not very useful for exploits?
> +void dotraplinkage do_intcc(struct pt_regs *regs, long error_code)
> +{
> + /* Kernel code must never get here. */
> + if (!user_mode(regs))
> + BUG();
Nit: you can use BUG_ON() for that.
> + local_irq_enable();
> +
> + if (!in_vsyscall_page(regs->ip)) {
> + struct task_struct *tsk = current;
> + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
Nit: please put an empty new line between local variable definitions
and the first statement that follows - we do this for visual clarity.
A not-so-nit: i'd not limit this message to unhandled signals alone.
An attacker could install a SIGSEGV handler, send a SIGSEGV and
attempt the exploit right then - he'll get a free attempt with no
logging performed, right?.
> + printk_ratelimit()) {
> + printk(KERN_INFO
> + "%s[%d] illegal int $0xCC ip:%lx sp:%lx",
> + tsk->comm, task_pid_nr(tsk),
> + regs->ip, regs->sp);
I'd suggest putting the text 'exploit attempt?' into the printk
somewhere - a sysadmin might not necessarily know what an illegal int
$0xCC is..
> + print_vma_addr(" in ", regs->ip);
> + printk("\n");
> + }
> +
> + force_sig(SIGSEGV, current);
> + return;
> + }
> +
> + if (current->seccomp.mode) {
> + do_exit(SIGKILL);
> + return;
> + }
> +
> + regs->ax = sys_gettimeofday((struct timeval __user *)regs->di, NULL);
Does the vsyscall gettimeofday ignore the zone parameter too?
> +
> + local_irq_disable();
> + return;
> +}
Nit: no need for a 'return;' at the end of a void function.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists