linux-kernel - Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <04f2853d-3bdb-8893-91c8-074893310d1d@oracle.com>
Date:   Fri, 19 May 2023 18:24:34 -0500
From:   Mike Christie <michael.christie@...cle.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Oleg Nesterov <oleg@...hat.com>, linux@...mhuis.info,
        nicolas.dichtel@...nd.com, axboe@...nel.dk,
        torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org,
        virtualization@...ts.linux-foundation.org, mst@...hat.com,
        sgarzare@...hat.com, jasowang@...hat.com, stefanha@...hat.com,
        brauner@...nel.org
Subject: Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if
 SIGNAL_GROUP_EXIT/group_exec_task is set

On 5/18/23 11:16 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@...cle.com> writes:
> 
>> On 5/18/23 1:28 PM, Eric W. Biederman wrote:
>>> Still the big issue seems to be the way get_signal is connected into
>>> these threads so that it keeps getting called.  Calling get_signal after
>>> a fatal signal has been returned happens nowhere else and even if we fix
>>> it today it is likely to lead to bugs in the future because whoever is
>>> testing and updating the code is unlikely they have a vhost test case
>>> the care about.
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 8f6330f0e9ca..4d54718cad36 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>>  
>>>  void recalc_sigpending(void)
>>>  {
>>> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
>>> +       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>>> +           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>>> +                   !__fatal_signal_pending(current)))
>>>                 clear_thread_flag(TIF_SIGPENDING);
>>>  
>>>  }
>>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>>>                  * This signal will be fatal to the whole group.
>>>                  */
>>>                 if (!sig_kernel_coredump(sig)) {
>>> +                       /*
>>> +                        * The signal is being short circuit delivered
>>> +                        * don't it pending.
>>> +                        */
>>> +                       if (type != PIDTYPE_PID) {
>>> +                               sigdelset(&t->signal->shared_pending,  sig);
>>> +
>>>                         /*
>>>                          * Start a group exit and wake everybody up.
>>>                          * This way we don't have other threads
>>>
>>
>> If I change up your patch so the last part is moved down a bit to when we set t
>> like this:
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 0ac48c96ab04..c976a80650db 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>  
>>  void recalc_sigpending(void)
>>  {
>> -	if (!recalc_sigpending_tsk(current) && !freezing(current))
>> +	if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>> +	    ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>> +	     !__fatal_signal_pending(current)))
>>  		clear_thread_flag(TIF_SIGPENDING);
>> -
> Can we get rid of this suggestion to recalc_sigpending.  The more I look
> at it the more I am convinced it is not safe.  In particular I believe
> it is incompatible with dump_interrupted() in fs/coredump.c


With your clear_thread_flag call in vhost_worker suggestion I don't need
the above chunk.


> 
> The code in fs/coredump.c is the closest code we have to what you are
> trying to do with vhost_worker after the session is killed.  It also
> struggles with TIF_SIGPENDING getting set. 
>>  }
>>  EXPORT_SYMBOL(recalc_sigpending);
>>  
>> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>>  			signal->group_exit_code = sig;
>>  			signal->group_stop_count = 0;
>>  			t = p;
>> +			/*
>> +			 * The signal is being short circuit delivered
>> +			 * don't it pending.
>> +			 */
>> +			if (type != PIDTYPE_PID) {
>> +				struct sigpending *pending;
>> +
>> +				pending = &t->signal->shared_pending;
>> +				sigdelset(&pending->signal, sig);
>> +			}
>> +
>>  			do {
>>  				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>>  				sigaddset(&t->pending.signal, SIGKILL);
>>
>>
>> Then get_signal() works like how Oleg mentioned it should earlier.
> 
> I am puzzled it makes a difference as t->signal and p->signal should
> point to the same thing, and in fact the code would more clearly read
> sigdelset(&signal->shared_pending, sig);


Yeah either should work. The original patch had used t before it was
set so my patch just moved it down to after we set it. I just used signal
like you wrote and it works fine.


> 
> But all of that seems minor.
> 
>> For vhost I just need the code below which is just Linus's patch plus a call
>> to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.
>>
>> Note that when we get SIGKILL, the vhost file_operations->release function is called via
>>
>>             do_exit -> exit_files -> put_files_struct -> close_files
>>
>> and so the vhost release function starts to flush IO and stop the worker/vhost
>> task. In vhost_worker() then we just handle those last completions for already
>> running IO. When  the vhost release function detects they are done it does
>> vhost_task   _stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
>> So we don't return immediately when get_signal() returns non-zero.
>>
>> So it works, but it sounds like you don't like vhost relying on the behavior,
>> and it's non standard to use get_signal() like we are. So I'm not sure how we
>> want to proceed.
> 
> Let me clarify my concern.
> 
> Your code modifies get_signal as:
>  		/*
> -		 * PF_IO_WORKER threads will catch and exit on fatal signals
> +		 * PF_USER_WORKER threads will catch and exit on fatal signals
>  		 * themselves. They have cleanup that must be performed, so
>  		 * we cannot call do_exit() on their behalf.
>  		 */
> -		if (current->flags & PF_IO_WORKER)
> +		if (current->flags & PF_USER_WORKER)
>  			goto out;
>  		/*
>  		 * Death signals, no core dump.
>  		 */
>  		do_group_exit(ksig->info.si_signo);
>  		/* NOTREACHED */
> 
> Which means by modifying get_signal you are logically deleting the
> do_group_exit from get_signal.  As far as that goes that is a perfectly
> reasonable change.  The problem is you wind up calling get_signal again
> after that.  That does not make sense.
> 
> I would suggest doing something like:

I see. I've run some tests today and what you suggested for vhost_worker
and your signal change and it works for SIGKILL/STOP/CONT and freeze.

> 
> What is the diff below?  It does not appear to a revert diff.

It was just the most simple patch that was needed with your signal changes
(and the PF_IO_WORKER -> PF_USER_WORKER signal change) to fix the 2
regressions reported. I wanted to give the vhost devs an idea of what was
needed with your signal changes.

Let me do some more testing over the weekend and I'll post a RFC with your
signal change and the minimal changes needed to vhost to handle the 2
regressions that were reported. The vhost developers can get a better idea
of what needs to be done and they can better decide what they want to do to
proceed.