[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87pmwddfar.fsf@disp2133>
Date: Tue, 22 Jun 2021 11:39:40 -0500
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Al Viro <viro@...iv.linux.org.uk>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Michael Schmitz <schmitzmic@...il.com>,
linux-arch <linux-arch@...r.kernel.org>,
Jens Axboe <axboe@...nel.dk>, Oleg Nesterov <oleg@...hat.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Richard Henderson <rth@...ddle.net>,
Ivan Kokshaysky <ink@...assic.park.msu.ru>,
Matt Turner <mattst88@...il.com>,
alpha <linux-alpha@...r.kernel.org>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
linux-m68k <linux-m68k@...ts.linux-m68k.org>,
Arnd Bergmann <arnd@...nel.org>,
Ley Foon Tan <ley.foon.tan@...el.com>,
Tejun Heo <tj@...nel.org>, Kees Cook <keescook@...omium.org>
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
Al Viro <viro@...iv.linux.org.uk> writes:
> On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
>> Al Viro <viro@...iv.linux.org.uk> writes:
>>
>> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
>> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>> >>
>> >> > And I think our horrible "kernel threads return to user space when
>> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
>> >> > mostly of the historical horror sort.
>> >>
>> >> How would you prefer to handle that, then? Separate magical path from
>> >> kernel_execve() to switch to userland? We used to have something of
>> >> that sort, and that had been a real horror...
>> >>
>> >> As it is, it's "kernel thread is spawned at the point similar to
>> >> ret_from_fork(), runs the payload (which almost never returns) and
>> >> then proceeds out to userland, same way fork(2) would've done."
>> >> That way kernel_execve() doesn't have to do anything magical.
>> >>
>> >> Al, digging through the old notes and current call graph...
>> >
>> > FWIW, the major assumption back then had been that get_signal(),
>> > signal_delivered() and all associated machinery (including coredumps)
>> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
>> >
>> > And "has complete registers on stack" is only a part of that;
>> > there was other fun stuff in the area ;-/ Do we want coredumps for
>> > those, and if we do, will the de_thread stuff work there?
>>
>> Do we want coredumps from processes that use io_uring? yes
>> Exactly what we want from io_uring threads is less clear. We can't
>> really give much that is meaningful beyond the thread ids of the
>> io_uring threads.
>>
>> What problems do are you seeing beyond the missing registers on the
>> stack for kernel threads?
>>
>> I don't immediately see the connection between coredumps and de_thread.
>>
>> The function de_thread arranges for the fatal_signal_pending to be true,
>> and that should work just fine for io_uring threads. The io_uring
>> threads process the fatal_signal with get_signal and then proceed to
>> exit eventually calling do_exit.
>
> I would like to see the testing in cases when the io-uring thread is
> the one getting hit by initial signal and when it's the normal one
> with associated io-uring ones. The thread-collecting logics at least
> used to depend upon fairly subtle assumptions, and "kernel threads
> obviously can't show up as candidates" used to narrow the analysis
> down...
>
> In any case, WTF would we allow reads or writes to *any* registers of
> such threads? It's not as simple as "just return zeroes", BTW - the
> values allowed in special registers might have non-trivial constraints
> on them. The same goes for coredump - we don't _have_ registers to
> dump for those, period.
>
> Looks like the first things to do would be
> * prohibit ptrace accessing any regsets of worker threads
> * make coredump skip all register notes for those
Skipping register notes is fine. Prohibiting ptrace access to any
regsets of worker threads is interesting. I think that was tried and
shown to confuse gdb. So the conclusion was just to provide a fake set
of registers.
Which has appears to work up to the point of dealing with architectures
that have their magic caller-saved optimization (like alpha and m68k),
and no check that all of the registers were saved when accessed. Adding
a dummy switch stack frame for the kernel threads on those architectures
looks like a good/cheap solution at first glance.
> Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
> ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
> current->ptrace in the caller of the latter. So fork and exec side
> has ptrace_event() crap limited to real syscalls.
That is where I thought we were. Thanks for confirming that.
> It's seccomp[1] and exit-related stuff that are messy...
>
> [1] "never trust somebody who introduces himself as Honest Joe and keeps
> carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
> etc.
Powered by blists - more mailing lists