linux-kernel - Re: [regression] boot failure on alpha, bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121007193909.GK2616@ZenIV.linux.org.uk>
Date:	Sun, 7 Oct 2012 20:39:09 +0100
From:	Al Viro <viro@...IV.linux.org.uk>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	dl8bcu@...bcu.de, peterz@...radead.org, mingo@...nel.org,
	linux-kernel@...r.kernel.org, linux-alpha@...r.kernel.org,
	Richard Henderson <rth@...ddle.net>,
	Ivan Kokshaysky <ink@...assic.park.msu.ru>,
	Matt Turner <mattst88@...il.com>
Subject: Re: [regression] boot failure on alpha, bisected

On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:

> > Um...  There's a bunch of architectures that are in the same situation.
> > grep for do_notify_resume() and you'll see...
> 
> And every do_notify_resume() should be changed anyway, do_signal() and
> tracehook_notify_resume() should be re-ordered.

There's a bit more to it.  The thing is, we have quite a mess around
the signal-handling loops, mixed with that regarding the signal restarts.
On arm it's done about right by now:
	* looping until all signals had been handled is done in C;
none of that "loop in asm glue" nonsense anymore.
	* prevention of double restarts is *also* there, TYVM.
	* do_work_pending() is called with interrupts disabled.
It may return 0, in which case we are done, interrupts are disabled
and the caller should proceed to userland without reenabling them
until it leaves.  Otherwise we have a syscall restart to handle and
no userland signal handler had been invoked.  Interrupts are enabled
and we should simply reload arguments and syscall number from pt_regs
and proceed to syscall entry, without returning to userland.  The only
twist is that negative return value means ERESTART_RESTARTBLOCK kind
of restart, in which case we need to use __NR_restart_syscall for
syscall number.

Note that we do *not* go through return to userland and reentering the
kernel on handlerless syscall restarts.  S390 uses the same model, but
there it's done in assembler glue - for no good reason.  Should be in
straight C.

For alpha there's another twist, though - there we do _not_ save all
registers in pt_regs; there's a fairly large chunk of callee-saved
registers we don't need to protect from being messed by C parts of
the kernel.  We do need to save them in sigcontext, though.  So alpha
(and quite a few other architctures) has separate struct switch_stack
(named so since switch_to() needs to save/restore the same registers).
Rules:
	* on fork() et.al. we save those callee-saved registers in
struct switch_stack, right next to pt_regs.  We do that before calling
the actual sys_fork() and have copy_thread() copy these guys into
child.  Remember that newborns are first woken up in ret_from_fork
and as with all context switches they go through switch_to().  So these
registers are restored by the time the sucker wakes up.
	* on signal delivery we save those registers in struct switch_stack
and use it, along with pt_regs it lives next to, to fill sigcontext.
	* ptrace counts on those suckers being next to pt_regs.  That allows
tracer to modify tracee's registers, including callee-saved ones.  So we
(1) restore them from switch_stack once we are done with do_signal() and
(2) save/restore them around another place where we can get stopped for
tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
	* on sigreturn/rt_sigreturn we need to restore all registers.
So we reserve switch_stack on stack, next to pt_regs and have the C part of
sigreturn fill those along with pt_regs.  Once we are done, read those
registers from switch_stack.

That's more or less it; many other architectures are doing more or less
similar things, but not all of them put that stuff into separate structure.
E.g. another valid solution is to leave space in pt_regs, fill only a subset
on entry and have switch_to() save stuff in task_struct instead of putting
it on kernel stack.

What it means for us is that saving all that crap on stack should *not*
be done unless we have work to do.  OTOH, in situations when we have
more than one pending signal it's bloody dumb to save/restore around
each do_notify_resume() call separately.  OTTH, in situation when we'd
run out of timeslice and had nothing arrive until we'd regained CPU
save/restore around schedule() is pointless at the very least.  So for
things like alpha I'd do this:

	interrupts disabled
	check thread flags
	no work to do => bugger off to userland
	just NEED_RESCHED?
		schedule()
		reread thread flags
		no work to do => bugger off to userland
	save callee-saved registers
	call do_work_pending
	restore callee-saved registers
	if do_work_pendign returned 0 => bugger off to userland
	deal with handlerless restart

Note that the loop around do_signal() and friends is in C and is fairly
similar to what we've got on ARM.  x86 is in intermediate situation -
the main complication there is v86 crap.

I'd say that for now your variant should do, but we really need to get
that crap under control and out of asm glue.  Are you willing to participate?
Guys, we need a way to do cross-architecture work without going insane.
I've spent quite a bit of time this year crawling through that stuff.
And yes, it's getting better as the result, but it's not sustainable -
I have VFS work to do, after all.

Basically, we need more people willing to take part in that; ideally -
architecture maintainers, but some of them are semi-MIA.  The areas
involved:
	* kernel_thread()/kernel_execve()/sys_execve()/fork()/vfork()/clone() -
quite a bit of that is already done and I hope we'll regularize that crap
in the coming cycle.
	* signal handling in general - a lot got done this spring and summer,
quite a bit more is possible to unify.  I've got a long list of common
landmines not to step upon and unfortunately it's *very* common to have
architectures step on a bunch of those.
	* syscall restarts - see above; note that e.g. prevention of
double restarts and restarts on sigreturn is subtle, arch-dependent
and had been broken on *many* architectures.  And I'm not at all sure
we'd got all suckers fixed.
	* ptrace work, especially around PTRACE_SYSCALL handling.  I suspect
that the right way to handle it is a new regset aliasing the normal registers,
so that access to syscall arguments would be arch-independent.  We can
do that, and it would simplify the living hell out of e.g. audit hookup.
Another (and closely relate) thing is conversion to tracehook_report_syscall_*;
the tricky bit is that we probably want a uniform semantics for things like
modifying syscall arguments via ptrace; some architectures do it right and
reload arguments and syscall number from pt_regs after they'd done
tracehook_report_syscall_entry(), but not all of them do.  Moreover, we
probably want to short-circuit the syscall itself when PTRACE_CONT had
been done with "and deliver SIGKILL to the tracee" as e.g. x86, sparc and
ppc do.
	* interplay between single-stepping and syscall restarts.  Really,
really nasty.  And needs involvement of e.g. gdb people to sort out.

	We really need that stuff sanely synchronized between architectures.
I'm willing to keep participating in that work, but I can't do that alone.
It's simply not survivable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/