lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 10 Feb 2022 12:43:14 -0800 From: Kees Cook <keescook@...omium.org> To: "Eric W. Biederman" <ebiederm@...ssion.com> Cc: Robert Święcki <robert@...ecki.net>, Andy Lutomirski <luto@...capital.net>, Will Drewry <wad@...omium.org>, linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org Subject: Re: [PATCH 0/3] signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote: > Kees Cook <keescook@...omium.org> writes: > > > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote: > >> Kees Cook <keescook@...omium.org> writes: > >> > >> > Hi, > >> > > >> > This fixes the signal refactoring to actually kill unkillable processes > >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the > >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to > >> > fail more safely. This was a partial seccomp bypass, in the sense that > >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other > >> > aspects of the filters. (i.e. the syscall was still blocked, etc.) > >> > >> Any luck on figuring out how to suppress the extra event? > > > > I haven't found a good single indicator of a process being in an "I am dying" > > state, and even if I did, it seems every architecture's exit path would > > need to add a new test. > > The "I am dying" state for a task is fatal_signal_pending, at least > before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT. > Something I am busily cleaning up and making more reliable at the > moment. The state I need to catch is "I am dying and this syscall was interrupted". fatal_signal_pending() is kind of only the first half (though it doesn't cover fatal SIGSYS?) For example, if a process hits a BUG() in the middle of running a syscall, that syscall isn't expected to "exit" from the perspective of userspace. This is similarly true for seccomp's fatal SIGSYS. > What is the event that is happening? Is it > tracehook_report_syscall_exit or something else? Yes, but in more completely, it's these three, which are called in various fashions from architecture syscall exit code: audit_syscall_exit() (audit) trace_sys_exit() (see "TRACE_EVENT_FN(sys_exit,") tracehook_report_syscall_exit() (ptrace) > From the bits I have seen it seems like something else. But yes, the place Robert and I both noticed it was with ptrace from tracehook_report_syscall_exit(), which is rather poorly named. :) Looking at the results, audit_syscall_exit() and trace_sys_exit() need to be skipped too, since they would each be reporting potential nonsense. > > The best approach seems to be clearing the TIF_*WORK* bits, but that's > > still a bit arch-specific. And I'm not sure which layer would do that. > > At what point have we decided the process will not continue? More > > than seccomp was calling do_exit() in the middle of a syscall, but those > > appear to have all been either SIGKILL or SIGSEGV? > > This is where I get confused what TIF_WORK bits matter? This is where I wish all the architectures were using the common syscall code. The old do_exit() path would completely skip _everything_ in the exit path, so it was like never calling anything after the syscall dispatch table. The only userspace visible things in there are triggered from having TIF_WORK... flags (but again, it's kind of a per-arch mess). Skipping the entire exit path makes a fair bit of sense. For example, rseq_syscall() is redundant (forcing SIGSEGV). Regardless, at least the three places above need to be skipped. But just testing fatal_signal_pending() seems wrong: a normal syscall could be finishing just fine, it just happens to have a fatal signal ready to be processed. Here's the ordering after a syscall on x86 from do_syscall_64(): do_syscall_x64() sys_call_table[...](regs) syscall_exit_to_user_mode() __syscall_exit_to_user_mode_work() syscall_exit_to_user_mode_prepare() syscall_exit_work() arch_syscall_exit_tracehook() tracehook_report_syscall_exit() exit_to_user_mode_prepare() exit_to_user_mode_loop() handle_signal_work() arch_do_signal_or_restart() get_signal() do_group_exit() Here's arm64 from el0_svc(): do_el0_svc() el0_svc_common() invoke_syscall() syscall_table[...](regs) syscall_trace_exit() tracehook_report_syscall() tracehook_report_syscall_exit() exit_to_user_mode() prepare_exit_to_user_mode() do_notify_resume() do_signal() get_signal() do_group_exit() In the past, any do_exit() would short circuit everything after the syscall table. Now, we do all the exit work before starting the return to user mode which is what processes the signals. So I guess there's more precisely a difference between "visible to userspace" and "return to userspace". (an aside: where to PF_IO_WORKER threads die?) > I expect if anything else mattered we would need to change it to > HANDLER_EXIT. > > I made a mistake conflating to cases and I want to make certain I > successfully separate those two cases at the end of the day. For skipping the exit work, I'm not sure it matters, since all the signal stuff is "too late"... -- Kees Cook
Powered by blists - more mailing lists