linux-kernel - Re: [PATCH] [RFC] syscalls,x86: Add execveat() system call (v2)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120802103001.GJ6481@ZenIV.linux.org.uk>
Date:	Thu, 2 Aug 2012 11:30:02 +0100
From:	Al Viro <viro@...IV.linux.org.uk>
To:	Meredydd Luff <meredydd@...atehouse.org>
Cc:	"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
	Kees Cook <keescook@...omium.org>,
	Ingo Molnar <mingo@...hat.com>, Jeff Dike <jdike@...toit.com>,
	Richard Weinberger <richard@....at>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-arch@...r.kernel.org
Subject: Re: [PATCH] [RFC] syscalls,x86: Add execveat() system call (v2)

On Thu, Aug 02, 2012 at 10:14:53AM +0100, Meredydd Luff wrote:
> On Thu, Aug 2, 2012 at 7:55 AM, Al Viro <viro@...iv.linux.org.uk> wrote:
> >> This means you need an x32 version of the function -- execve
> >> unfortunately is one of the few system calls which require a special x32
> >> version (although it's a simple wrapper around sys32_execve).  See
> >> sys_x32_execve.
> >
> > I *really* strongly object to doing that thing before we sanitize the
> > situation with sys_execve().
> 
> "That thing" = "creating an x32 entry stub", or "merging execveat() at all"?
> 
> (snip)
> > The thing is, there's essentially no reason to have more than one
> > implementation.  What they are (badly) doing is "we need to find
> > pt_regs to pass to do_execve(), the thing we are after has to be near
> > our stack frame, so let's try to get to it that way".
> 
> Hang on...it's not just sys_execve that fits that description, is it?
> You seem to be describing every call that needs a pt_regs parameter,
> which at a glance is anything with a stub_ or PTREGSCALL in
> arch/x86/kernel/entry_{32,64}.S. That's: clone, fork, vfork,
> sigaltstack, iopl, execve, sigreturn, rt_sigreturn, vm86, vm86old.
> Most of those are handled by a common PTREGSCALL macro, but there are
> a few that get special treatment (different set on each arch - on
> x86-64 it's execve and rt_sigreturn ; on i386 it's just clone).
> 
> Is there's something special about execve in particular, or do you
> want to overhaul all the ptregscalls?

You are looking at that from the wrong side; it's not that this
set of syscalls on x86 has the same kind of wrapper - it's that
on different architectures the kludges are seriously different
and fairly brittle.  sigaltstack(), BTW, doesn't need to be
arch-specific at all - if you check what its pt_regs argument
is used for, we just need something like current_user_stack_pointer()
and that's it.  It's a different patch series, anyway.

There's a difference between "here's the syscall implementation,
here's its hookup for x86, let other architectures update their
syscall tables" and "here's the implementation in arch/x86 and its
hookup for x86; let other architectures port it in whatever way
they need".  The latter is a recipe for breakage - we'd been
through that kind of story quite a few times.

Out of that set, vm86/vm86old/iopl are genuine x86-only syscalls.
sigreturn and rt_sigreturn are weird, subtle and, sadly,
unmergable - syscall restart prevention logics is different
enough to make it not feasible at the moment.  It's a serious
source of PITA, BTW - quite a few of those had the same bugs,
years after they'd been discovered and fixed on a subset of
architectures.  Been there, fixed some of those, the latest
batch this April...  fork/vfork/clone is an interesting
question - IIRC, there are seriously subtle issues of some
sort on sparc, but I don't remember details right now.

Really, go and grep for do_execve; most of the callers will
be in execve(2) implementations.  Look at them carefully -
while some get pt_regs from some wrapper, there's a bunch
that does that manually in C.  "Ugly" doesn't even begin
to describe what's being done...

FWIW, I've just pushed (completely untested) arm and alpha
parts of what I described into signal.git#execve2; x86 is
next.  Note that after that sys_execve() is identical on
converted architectures and can be merged; ditto for
kernel_execve().  After I do x86 counterpart, I'll
take those guys to fs/exec.c under ifdef for new __ARCH_HAS_...
(and define it on already converted ones, obviously).
Then your patch goes there, except that implementation
gets put into fs/exec.c, under the same ifdef.  And with
current_pt_regs() used instead of the extra argument,
of course.  From that point on it can be used on any converted
architecture.  And conversion consists of
	* providing ret_from_kernel_execve() that would
put pt_regs where they belong and return to userland.
	* providing current_pt_regs(), with default being
just task_pt_regs(current).
	* defining that new __ARCH_HAS_...
	* removing sys_execve()/kernel_execve() implementations;
the ones in fs/exec.c will work just fine.
Can be done at leisure, architecture by architecture.  It's
obviously the next cycle fodder - we have only a couple of
days left of this merge window, anyway.

I really think that this pair of primitives is the right
way to deal with execve mess.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/