linux-kernel - Re: Question about kill a process group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <18e90120-dad4-853e-0373-f114d94c53ee@huawei.com>
Date:   Mon, 26 Sep 2022 15:32:50 +0800
From:   Zhang Qiao <zhangqiao22@...wei.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>,
        Thomas Gleixner <tglx@...utronix.de>
CC:     lkml <linux-kernel@...r.kernel.org>, <keescook@...omium.org>,
        "Peter Zijlstra" <peterz@...radead.org>, <elver@...gle.com>,
        <legion@...nel.org>, <oleg@...hat.com>, <brauner@...nel.org>
Subject: Re: Question about kill a process group



在 2022/5/13 2:23, Eric W. Biederman 写道:
> Thomas Gleixner <tglx@...utronix.de> writes:
> 
>> On Wed, May 11 2022 at 13:33, Eric W. Biederman wrote:
>>> Thomas Gleixner <tglx@...utronix.de> writes:
>>>> So unless the number of PIDs for a user is limited this _is_ an
>>>> unpriviledged DoS vector.
>>>
>>> After having slept on this a bit it finally occurred to me the
>>> semi-obvious solution to this issue is to convert tasklist_lock
>>> from a rw-spinlock to rw-semaphore.  The challenge is finding
>>> the users (tty layer?) that generate signals from interrupt
>>> context and redirect that signal generation.
>>
>> From my outdated notes where I looked at this before:
>>
>> [soft]interrupt context which acquires tasklist_lock:
>> sysrq-e		send_sig_all()
>> sysrq-i         send_sig_all()
>> sysrq-n		normalize_rt_tasks()
>>
>> tasklist_lock nesting into other locks:
>>    fs/fcntl.c: send_sigio(), send_sigurg()
>>
>>    send_sigurg() is called from the network stack ...
>>
>> Some very obscure stuff in arch/ia64/kernel/mca.c which is called
>> from a DIE notifier.
> 
> I think we are very close to the point that if ia64 is the only user
> problem case we can just do "git rm arch/ia64".  I am not certain
> there is even anyone left that cares enough to report breakage
> on ia64.
> 
>> Plus quite a bunch of read_lock() instances which nest inside
>> rcu_read_lock() held sections.
>>
>> This is probably incomplete, but the scope of the problem has been
>> greatly reduced vs. the point where I looked at it last time a couple of
>> years ago. But that's still a herculean task.
> 
> I won't argue.
> 
>>> Once signals holding tasklist_lock are no longer generated from
>>> interrupt context irqs no longer need to be disabled and
>>> after verifying tasklist_lock isn't held under any other spinlocks
>>> it can be converted to a semaphore.
>>
>> Going to take a while. :)
> 
> It is a very tractable problem that people can work on incrementally.
> 
>>> It won't help the signal delivery times, but it should reduce
>>> the effect on the rest of the system, and prevent watchdogs from
>>> firing.
>>
>> The signal delivery time itself is the least of the worries, but this
>> still prevents any other operations which require tasklist_lock from
>> making progress for quite some time, i.e. fork/exec for unrelated
>> processes/users will have to wait too. So you converted the 'visible'
>> DoS to an 'invisible' one.
>>
>> The real problem is that the scope of tasklist_lock is too broad for
>> most use cases. That does not change when you actually can convert it to
>> a rwsem. The underlying problem still persists.
>>
>> Let's take a step back and look what most sane use cases (sysrq-* is not
>> in that category) require:
>>
>>    Preventing that tasks are added or removed
>>
>> Do they require that any task is added or removed? No.
>>
>>    They require to prevent add/remove for the intended scope.
>>
>> That's the thing we need to focus on: reducing the protection scope.
>>
>> If we can segment the protection for the required scope of e.g. kill(2)
>> then we still can let unrelated processes/tasks make progress and just
>> inflict the damage on the affected portion of processes/tasks.
>>
>> For example:
>>
>> 	read_lock(&tasklist_lock);
>> 	for_each_process(p) {
>> 		if (task_pid_vnr(p) > 1 &&
>> 		    !same_thread_group(p, current)) {
>>
>> 			group_send_sig_info(...., p);
>> 		}
>> 	}
>> 	read_unlock(&tasklist_lock);
>>
>> same_thread_group() does:
>>
>>    return p->signal == current->signal;
> 
> Yes.  So the sender can not send a signal to itself.
> Basically it is a test to see if a thread is a member of a process.
> 
>> Ideally we can do:
>>
>> 	read_lock(&tasklist_lock);
>>         prevent_add_remove(current->signal);
>> 	read_unlock(&tasklist_lock);
>>
>>         rcu_read_lock();
>> 	for_each_process(p) {
>> 		if (task_pid_vnr(p) > 1 &&
>> 		    !same_thread_group(p, current)) {
>>
>> 			group_send_sig_info(...., p);
>> 		}
>> 	}
>>         rcu_read_unlock();
>>
>>         allow_add_remove(current->signal);
>>
>> Where prevent_add_remove() sets a state which has to be waited for to be
>> cleared by anything which wants to add/remove a task in that scope or
>> change $relatedtask->signal until allow_add_remove() removes that
>> blocker. I'm sure it's way more complicated, but you get the idea.
> 
> Hmm.
> 
> For sending signals what is needed is the guarantee that the signal is
> sent to an atomic snapshot of the appropriate group of processes so that
> SIGKILL sent to the group will reliably kill all of the processes.  It
> should be ok for a process to exit on it's own from the group.  As long
> as it logically looks like the process exited before the signal was
> sent.
> 
> There is also ptrace_attach/__ptrace_unlink, reparenting,
> kill_orphaned_pgrp, zap_pid_ns_processes, and pid hash table
> maintenance in release_task.
> 
> I have a patch I am playing with that protects task->parent and
> task->real_parent with siglock and with a little luck that can
> be generalized so that sending signals to parents and ptrace don't
> need tasklist_lock at all.
> 
> For reparenting of children the new parents list of children
> needs protection but that should not need tasklist lock.
> 
> For kill_orphaned_pgrp with some additional per process group
> maintenance state so that will_become_orphaned_pgrp and has_stopped_jobs
> don't need to traverse the process group it should be possible to
> just have it send a sender of a process group signal.
> 
> zap_pid_ns_processes is already called without the tasklist_lock.
> 
> Maintenance of the pid hash table certainly needs a write lock in
> __exit_signal but it doesn't need to be tasklist_lock.
> 
> 
> 
> 
> 
> Which is a long way of saying semantically all we need is to
> prevent_addition to the group of processes a signal will be sent to.  We
> have one version of that prevention today in fork where it tests
> fatal_signal_pending after taking tasklist_lock and siglock.  For the
> case you are describing the code would just need to check each group of
> processes the new process is put into.
> 
> 
> Hmm.
> 
> When I boil it all down in my head I wind up with something like:
> 
> 	rwlock_t *lock = signal_lock(enum pid_type type);
>         read_lock(lock);
>         	/* Do the work of sending the signal */
>         read_unlock(lock);
> 
> With fork needing to grab all of those possible locks for write
> as it adds the process to the group.
> 
> Maybe it could be something like:
> 
> 	struct group_signal {
>         	struct hlist_node node;
>                 struct kernel_siginfo *info;
>         };
> 
> 	void start_group_signal(struct group_signal *group, struct
> 				kernel_siginfo *info, enum pid_type type);
> 	void end_group_signal(struct group_signal *group);
> 
> 	struct group_signal group_sig;
> 	start_group_signal(&group_sig, info, PIDTYPE_PGID);
> 
> 	/* Walk through the list and deliver the signal */
> 
> 	end_group_signal(&group_sig);
> 
> That would allow fork to see all signals that are being delivered to a
> group even it the signal has not been delivered to the parent process
> yet.  At which point the signal could be delivered to the parent before
> the fork.  I just need something to ensure that the signal delivery loop
> between start_group_signal and end_group_signal skips processes that
> hurried up and delivered the signal to themselves, and does not
> deliver to newly added processes.  A generation counter perhaps.
> 
> There is a lot to flesh out, and I am burried alive in other cleanups
> but I think that could work, and remove the need to hold tasklist_lock
> during signal delivery.
> 
> 
>> If we find a solution to this scope reduction problem, then it will not
>> only squash the issue which started this discussion. This will have a
>> benefit in general.
> 
> We need to go farther than simple scope reduction to benefit the
> original problem.  As all of the process in that problem were
> all sending a signal to the same process group.  So they all needed
> to wait for each other.
> 
> If we need to block adds then the adds need to effectively take a
> write_lock to the read_lock taken during signal delivery.  Because
> all of the blocking is the same we won't see an improvement of
> the original problem.
> 
> 
> 
> If in addition to scope reduction, a barrier is implemented so that
> it is guaranteed that past a certain point processes will see the signal
> before they fork (or do anything else that userspace could tell the
> signal was not delivered atomically) then I think we can eliminate
> blocking in the same places and an improvement in the issue that
> started this discussion can be seen.
> 
> 
> I will queue it up on my list of things I would like to do.  I am


hi, Eric:

Do you have any plans to fix it?  I look forward to your patches.

thanks!
-Zhang Qiao
.



> burried in other signal related cleanups at the moment so I don't know
> when I will be able to get to anything like that.  But I really
> appreciate the idea.
> 
> Eric
> 
> .
>