[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrU=fWvyOf-yWG=UQL4jfhbp1vwzPpBd+eeTLjk94xX+8A@mail.gmail.com>
Date: Tue, 24 Mar 2015 14:40:56 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Denys Vlasenko <dvlasenk@...hat.com>
Cc: Brian Gerst <brgerst@...il.com>, Ingo Molnar <mingo@...nel.org>,
Denys Vlasenko <vda.linux@...glemail.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Steven Rostedt <rostedt@...dmis.org>,
Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>, Oleg Nesterov <oleg@...hat.com>,
Frederic Weisbecker <fweisbec@...il.com>,
Alexei Starovoitov <ast@...mgrid.com>,
Will Drewry <wad@...omium.org>,
Kees Cook <keescook@...omium.org>, X86 ML <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
On Tue, Mar 24, 2015 at 1:17 PM, Denys Vlasenko <dvlasenk@...hat.com> wrote:
> On 03/24/2015 05:55 PM, Brian Gerst wrote:
>>>> Might be nice to place a more generic description there, which
>>>> registers are expected to be saved by user-space calling in here, etc.
>>>
>>> __kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
>>> the good old int 0x80 calling convention:
>>>
>>> syscall# in eax,
>>> params in ebx/ecx/edx/esi/edi/ebp,
>>> all registers are preserved by the syscall.
>>>
>>> (I think we don't guarantee that all flags are preserved:
>>> I have a testcase where DF gets cleared).
>>
>> DF should always be clear on any function call per the C ABI. But,
>> eflags should be preserved, at least the non-privileged bits. I'd
>> like to see that testcase.
>
> The testcase is a simplistic example of how to find and use
> 32-bit vDSO to perform system calls.
>
> It also sets flags.DF before syscall, and checks whether registers
> are preserved, including flags.DF.
>
> On 32-bit kernel (on Intel CPU, where vDSO uses SYSENTER), I see this:
>
> $ ./test32_syscall_vdso
> Result:1
>
> whereas on 64-bit it is
>
> ./test32_syscall_vdso
> Result:0
>
> "Result:1" means that DF was cleared.
>
> See attached source.
>
The syscall and sysenter stuff is IMO really nasty. Here's how I'd
like it to work:
When you do "call __kernel_vsyscall", I want the net effect to be that
your eax, ebx, ecx, edx, esi, edi, and ebp at the time of the call end
up *verbatim* in pt_regs. Your eip and rsp should be such that, if we
iret normally using pt_regs, we end up returning correctly to
userspace. I want this to be true *regardless* of whether we're doing
a fast-path or slow-path system call.
This means that we have, literally (see below for why ret $4):
int $0x80
ret $4 <-- regs->eip points here
Then we add an opportunistic return trampoline: if a special ti flag
is set (which we set on entry here) and the return eip and regs are
appropriate, then we change the return at the last minute to vdso code
that looks like:
popl $ecx
popl $edx
ret
Obviously, to do this, we need to copy regs->ecx and regs->edx to the
appropriate places. If we've been traced or other funny business is
going on (TIF_NOTIFY_RESUME or such is set), then we just skip the
optimization entirely. Everything still works.
This is probably slower than the current code, but I expect it would
be fast enough, and it should be considerably more obviously correct
than the current disaster.
Now we can do a fun hack on top. On Intel, we have sysenter/sysexitl
and, on AMD, we have syscall/sysretl. But, if I read the docs right,
Intel has sysretl, too. So we can ditch sysexit entirely, since this
mechanism no longer has any need to keep the entry and exit
conventions matching.
The vdso code would be something like (so untested it's not even funny):
__kernel_vsyscall:
ALTERNATIVE_2(something or other)
__kernel_vsyscall_for_intel:
pushl $edx
pushl $ecx
sysenter
hlt <-- just for clarity
__kernel_vsyscall_for_amd:
pushl $ecx
syscall
__vsyscall_after_syscall_insn:
ret $4 <-- for binary tracers only
__kernel_vsyscall_for_int80:
int $0x80 <-- regs->eip points here during *all* vsyscalls
__kernel_vsyscall_slow_ret:
ret $4
__kernel_vsyscall_sysretl_target:
popl $ecx
ret
There is no sysexit. Take that, Intel.
On sysenter, we copy regs->cx and regs->dx from user memory and then
we increment regs->sp by 4 and point regs->eip to
__kernel_vsyscall_for_int80. On syscall, we copy regs->cx from user
memory and point regs->eip to __kernel_vsyscall_for_int80.
On opportunistic sysretl, we do:
*regs->sp = regs->cx; /* put_user or whatever */
regs->eip = __kernel_vsyscall_sysretl_target
...
sysretl
We never do sysexit or sysretl in any other code path. That is, there
is no really fast path anymore.
On AMD, we could be polite and only do the opportunistic sysretl if we
regs->eip started out pointing to __vsyscall_after_syscall_insn.
Thoughts?
I'm not planning on implementing this in the very near future, though.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists