linux-kernel - Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 13 Jul 2012 19:48:09 -0500
From:	Will Drewry <wad@...omium.org>
To:	Andrew Lutomirski <luto@....edu>
Cc:	linux-kernel@...r.kernel.org, torvalds@...ux-foundation.org,
	qmewlo@...il.com, eparis@...hat.com, keescook@...omium.org,
	james.l.morris@...cle.com, hpa@...or.com, cevans@...omium.org
Subject: Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski <luto@....edu> wrote:
> On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry <wad@...omium.org> wrote:
>> If a seccomp filter program is installed, older static binaries and
>> distributions with older libc implementations (glibc 2.13 and earlier)
>> that rely on vsyscall use will be terminated regardless of the filter
>> program policy when executing time, gettimeofday, or getcpu.  This is
>> only the case when vsyscall emulation is in use (vsyscall=emulate is the
>> default).
>>
>> This patch emulates system call entry inside a vsyscall=emulate by
>> populating regs->ax and regs->orig_ax with the system call number prior
>> to calling into seccomp such that all seccomp-dependencies function
>> normally.  Additionally, system call return behavior is emulated in line
>> with other vsyscall entrypoints for the trace/trap cases.
>>
>> Reported-by: Owen Kibel <qmewlo@...il.com>
>> Signed-off-by: Will Drewry <wad@...omium.org>
>>
>> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to luto@....edu)
>
>> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>>
>>         current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
>>
>> +       if (skip) {
>> +               if ((long)regs->ax <= 0L) /* seccomp errno emulation */
>> +                       goto do_ret;
>> +               goto done; /* seccomp trace/trap */
>> +       }
>> +
>>         if (ret == -EFAULT) {
>>                 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
>>                 warn_bad_vsyscall(KERN_INFO, regs,
>> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>>
>>         regs->ax = ret;
>>
>> +do_ret:
>>         /* Emulate a ret instruction. */
>>         regs->ip = caller;
>>         regs->sp += 8;
>> -
>> +done:
>>         return true;
>>
>>  sigsegv:
>> --
>> 1.7.9.5
>>
>
> This has the same odd property as the sigsegv path that the faulting
> instruction will appear to be the mov, not the syscall.  That seems to
> be okay, though -- various pieces of code that try to restart the segv
> are okay with that.

Yeah - I would otherwise do
  regs->ip += 9;
but I wanted to match the code that was therefor SIGSEGV.  If regs->ip
+= 9 _just_ for the SIGSYS case is fine, then I'll make that change
shortly.  Since any code that sees the vsyscall address should be wise
enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
problem so far.

> Is there any code that assumes that changing rax (i.e. the syscall
> number) and restarting a syscall after SIGSYS will invoke the new
> syscall?  (The RET_TRACE path might be similar -- does the
> ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
> a chance to synchronously cancel or change the syscall?

Unfortunately, it does in normal interception. I don't see any way out
of that quirk with vsyscall=emulate.  As is without seccomp,
vsyscall=emulate doesn't allow ptrace interception (or syscall
auditing for that matter) while vsyscall=native does.   So the option
here is to document the quirky interaction in
Documentation/prctl/seccomp_filter.txt.  In particular, if the tracer
sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
it will know it can't rewrite or bypass the call.    Is there a better
option?

Given that, I will include a tweak to the documentation to indicate
that behavior so that userspace authors of BPF programs that use
SECCOMP_RET_TRACE will be aware of the behavior.

> If those issues aren't problems, then:
>
> Reviewed-by: Andy Lutomirski <luto@...capital.net>
>
> (If the syscall number needs to change after the fact in the
> SECCOMP_RET_TRAP case, it'll be a mess.)

Nah - traps are delivered like the forced sigsegv path.

I'll spin a v3 soon including the documentation tweak and the ip
offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the
skip case).  Of course, any better ideas for the trace-case will be
more than welcome, but it seems to me to be an acceptable tradeoff - I
hope others agree.

I'll make the changes and then put it through its paces to see if any
other little idiosyncrasies emerge.

Thanks for the close review!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/