[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202202101137.B48D02138@keescook>
Date: Thu, 10 Feb 2022 12:43:14 -0800
From: Kees Cook <keescook@...omium.org>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
Cc: Robert Święcki <robert@...ecki.net>,
Andy Lutomirski <luto@...capital.net>,
Will Drewry <wad@...omium.org>, linux-kernel@...r.kernel.org,
linux-hardening@...r.kernel.org
Subject: Re: [PATCH 0/3] signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote:
> Kees Cook <keescook@...omium.org> writes:
>
> > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
> >> Kees Cook <keescook@...omium.org> writes:
> >>
> >> > Hi,
> >> >
> >> > This fixes the signal refactoring to actually kill unkillable processes
> >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
> >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
> >> > fail more safely. This was a partial seccomp bypass, in the sense that
> >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
> >> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
> >>
> >> Any luck on figuring out how to suppress the extra event?
> >
> > I haven't found a good single indicator of a process being in an "I am dying"
> > state, and even if I did, it seems every architecture's exit path would
> > need to add a new test.
>
> The "I am dying" state for a task is fatal_signal_pending, at least
> before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
> Something I am busily cleaning up and making more reliable at the
> moment.
The state I need to catch is "I am dying and this syscall was
interrupted". fatal_signal_pending() is kind of only the first half
(though it doesn't cover fatal SIGSYS?)
For example, if a process hits a BUG() in the middle of running a
syscall, that syscall isn't expected to "exit" from the perspective of
userspace. This is similarly true for seccomp's fatal SIGSYS.
> What is the event that is happening? Is it
> tracehook_report_syscall_exit or something else?
Yes, but in more completely, it's these three, which are called in
various fashions from architecture syscall exit code:
audit_syscall_exit() (audit)
trace_sys_exit() (see "TRACE_EVENT_FN(sys_exit,")
tracehook_report_syscall_exit() (ptrace)
> From the bits I have seen it seems like something else.
But yes, the place Robert and I both noticed it was with ptrace from
tracehook_report_syscall_exit(), which is rather poorly named. :)
Looking at the results, audit_syscall_exit() and trace_sys_exit() need
to be skipped too, since they would each be reporting potential nonsense.
> > The best approach seems to be clearing the TIF_*WORK* bits, but that's
> > still a bit arch-specific. And I'm not sure which layer would do that.
> > At what point have we decided the process will not continue? More
> > than seccomp was calling do_exit() in the middle of a syscall, but those
> > appear to have all been either SIGKILL or SIGSEGV?
>
> This is where I get confused what TIF_WORK bits matter?
This is where I wish all the architectures were using the common syscall
code. The old do_exit() path would completely skip _everything_ in the
exit path, so it was like never calling anything after the syscall
dispatch table. The only userspace visible things in there are triggered
from having TIF_WORK... flags (but again, it's kind of a per-arch mess).
Skipping the entire exit path makes a fair bit of sense. For example,
rseq_syscall() is redundant (forcing SIGSEGV).
Regardless, at least the three places above need to be skipped.
But just testing fatal_signal_pending() seems wrong: a normal syscall
could be finishing just fine, it just happens to have a fatal signal
ready to be processed.
Here's the ordering after a syscall on x86 from do_syscall_64():
do_syscall_x64()
sys_call_table[...](regs)
syscall_exit_to_user_mode()
__syscall_exit_to_user_mode_work()
syscall_exit_to_user_mode_prepare()
syscall_exit_work()
arch_syscall_exit_tracehook()
tracehook_report_syscall_exit()
exit_to_user_mode_prepare()
exit_to_user_mode_loop()
handle_signal_work()
arch_do_signal_or_restart()
get_signal()
do_group_exit()
Here's arm64 from el0_svc():
do_el0_svc()
el0_svc_common()
invoke_syscall()
syscall_table[...](regs)
syscall_trace_exit()
tracehook_report_syscall()
tracehook_report_syscall_exit()
exit_to_user_mode()
prepare_exit_to_user_mode()
do_notify_resume()
do_signal()
get_signal()
do_group_exit()
In the past, any do_exit() would short circuit everything after the
syscall table. Now, we do all the exit work before starting the return
to user mode which is what processes the signals. So I guess there's
more precisely a difference between "visible to userspace" and "return
to userspace".
(an aside: where to PF_IO_WORKER threads die?)
> I expect if anything else mattered we would need to change it to
> HANDLER_EXIT.
>
> I made a mistake conflating to cases and I want to make certain I
> successfully separate those two cases at the end of the day.
For skipping the exit work, I'm not sure it matters, since all the
signal stuff is "too late"...
--
Kees Cook
Powered by blists - more mailing lists