linux-kernel - Re: [PATCH 0/3] signal: HANDLER_EXIT should clear SIGNAL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202202101710.668EDCDC@keescook>
Date:   Thu, 10 Feb 2022 17:26:27 -0800
From:   Kees Cook <keescook@...omium.org>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Robert Święcki <robert@...ecki.net>,
        Andy Lutomirski <luto@...capital.net>,
        Will Drewry <wad@...omium.org>, linux-kernel@...r.kernel.org,
        linux-hardening@...r.kernel.org
Subject: Re: [PATCH 0/3] signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE

On Thu, Feb 10, 2022 at 04:48:30PM -0600, Eric W. Biederman wrote:
> Kees Cook <keescook@...omium.org> writes:
> 
> > On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote:
> >> Kees Cook <keescook@...omium.org> writes:
> >> 
> >> > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
> >> >> Kees Cook <keescook@...omium.org> writes:
> >> >> 
> >> >> > Hi,
> >> >> >
> >> >> > This fixes the signal refactoring to actually kill unkillable processes
> >> >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
> >> >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
> >> >> > fail more safely. This was a partial seccomp bypass, in the sense that
> >> >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
> >> >> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
> >> >> 
> >> >> Any luck on figuring out how to suppress the extra event?
> >> >
> >> > I haven't found a good single indicator of a process being in an "I am dying"
> >> > state, and even if I did, it seems every architecture's exit path would
> >> > need to add a new test.
> >> 
> >> The "I am dying" state for a task is fatal_signal_pending, at least
> >> before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
> >> Something I am busily cleaning up and making more reliable at the
> >> moment.
> >
> > The state I need to catch is "I am dying and this syscall was
> > interrupted". fatal_signal_pending() is kind of only the first half
> > (though it doesn't cover fatal SIGSYS?)
> >
> > For example, if a process hits a BUG() in the middle of running a
> > syscall, that syscall isn't expected to "exit" from the perspective of
> > userspace. This is similarly true for seccomp's fatal SIGSYS.
> >
> >> What is the event that is happening?  Is it
> >> tracehook_report_syscall_exit or something else?
> >
> > Yes, but in more completely, it's these three, which are called in
> > various fashions from architecture syscall exit code:
> >
> > 	audit_syscall_exit()		(audit)
> > 	trace_sys_exit()		(see "TRACE_EVENT_FN(sys_exit,")
> > 	tracehook_report_syscall_exit()	(ptrace)
> >
> >> From the bits I have seen it seems like something else.
> >
> > But yes, the place Robert and I both noticed it was with ptrace from
> > tracehook_report_syscall_exit(), which is rather poorly named. :)
> 
> Speaking of patches I am just about to send out.
> 
> > Looking at the results, audit_syscall_exit() and trace_sys_exit() need
> > to be skipped too, since they would each be reporting potential nonsense.
> >
> >> > The best approach seems to be clearing the TIF_*WORK* bits, but that's
> >> > still a bit arch-specific. And I'm not sure which layer would do that.
> >> > At what point have we decided the process will not continue? More
> >> > than seccomp was calling do_exit() in the middle of a syscall, but those
> >> > appear to have all been either SIGKILL or SIGSEGV?
> >> 
> >> This is where I get confused what TIF_WORK bits matter?
> >
> > This is where I wish all the architectures were using the common syscall
> > code. The old do_exit() path would completely skip _everything_ in the
> > exit path, so it was like never calling anything after the syscall
> > dispatch table. The only userspace visible things in there are triggered
> > from having TIF_WORK... flags (but again, it's kind of a per-arch mess).
> >
> > Skipping the entire exit path makes a fair bit of sense. For example,
> > rseq_syscall() is redundant (forcing SIGSEGV).
> >
> > Regardless, at least the three places above need to be skipped.
> >
> > But just testing fatal_signal_pending() seems wrong: a normal syscall
> > could be finishing just fine, it just happens to have a fatal signal
> > ready to be processed.
> 
> Yes.  It is really just the HANDLER_EXIT case where this is interesting.

Right.

> 
> >
> > Here's the ordering after a syscall on x86 from do_syscall_64():
> >
> > do_syscall_x64()
> > 	sys_call_table[...](regs)
> > syscall_exit_to_user_mode()
> > 	__syscall_exit_to_user_mode_work()
> > 		syscall_exit_to_user_mode_prepare()
> > 			syscall_exit_work()
> > 				arch_syscall_exit_tracehook()
> > 					tracehook_report_syscall_exit()
> > 	exit_to_user_mode_prepare()
> > 		exit_to_user_mode_loop()
> > 			handle_signal_work()
> > 				arch_do_signal_or_restart()
> > 					get_signal()
> > 						do_group_exit()
> >
> > Here's arm64 from el0_svc():
> >
> > do_el0_svc()
> > 	el0_svc_common()
> > 		invoke_syscall()
> > 			syscall_table[...](regs)
> > 		syscall_trace_exit()
> > 			tracehook_report_syscall()
> > 				tracehook_report_syscall_exit()
> > exit_to_user_mode()
> > 	prepare_exit_to_user_mode()
> > 		do_notify_resume()
> > 			do_signal()
> > 				get_signal()
> > 					do_group_exit()
> >
> > In the past, any do_exit() would short circuit everything after the
> > syscall table. Now, we do all the exit work before starting the return
> > to user mode which is what processes the signals. So I guess there's
> > more precisely a difference between "visible to userspace" and "return
> > to userspace".
> 
> Yes.  I see that now.  I had not had an occasion to look at the order
> all of these were called in before and my mental model was wrong.

Yeah, I didn't even have a model of this all the way. I'd really only
understood the ptrace side of it.

> It makes a certain kind of sense that the per syscall work happens
> before we do additional things like process signals.  It simply
> had not realized that was happening in that order until now.
> 
> 
> > (an aside: where to PF_IO_WORKER threads die?)
> 
> They are calling do_exit explicitly.

Ah-ha, thanks.

> 
> >> I expect if anything else mattered we would need to change it to
> >> HANDLER_EXIT.
> >> 
> >> I made a mistake conflating to cases and I want to make certain I
> >> successfully separate those two cases at the end of the day.
> >
> > For skipping the exit work, I'm not sure it matters, since all the
> > signal stuff is "too late"...
> 
> The conflation lead me to believe that we could simply and safely cause
> seccomp to use normal signal delivery to kill the process.  The first
> part of the conflation I sorted out by introducing HANDLER_EXIT.  The
> user visible part of the change I am not yet certain what to do with.
> 
> My gut reaction is does it matter?  Can you escape the seccomp filter
> with a stop?  Does it break userspace?

After fixing UNKILLABLE vs IMMUTABLE, I'm not aware of anything else
misbehaving. The new nonsense exit event, though, is bound to be at
least confusing to humans. ("Why did this syscall not change any of its
registers?", etc.)

> I realize the outcome of that question is that it does matter so we
> probably need to find a way to supress that situation for HANDLER_EXIT.
> Both force_exit_sig and force_sig_seccomp appear to be using dumpable
> signals which makes the problem doubly tricky.
> 
> The first tricky bit is fatal_signal_pending isn't set because a
> coredump is possible, so something else is needed to detect this
> condition.
> 
> The second part is what to do when we detect the condition.
> 
> The only solution I can think of quickly is to modify
> force_sig_info_to_task clear TIF_SYSCALL_WORK on the architectures where
> that is used and to clear SYSCALL_WORK_EXIT on x86 and s390, and to do
> whatever the architecture appropriate thing is on the other
> architectures.

The common accessors for the bits are set_syscall_work()/clear_syscall_work()
but I don't see anything to operate on an entire mask. Maybe it needs to
grow something like reset_syscall_work()?

-Kees

> 
> It might be easier once I have cleaned some more code up but as long as
> we can limit the logic to force_sig_info_to_task for the HANDLER_EXIT
> case I don't see any advantage in the cleanups I have planned for this
> case.  This may be incentive to make the architecture code more uniform
> in this area.
> 
> Anyway I am off to finish fixing some other bugs.
> 
> Eric
> 
> 

-- 
Kees Cook