linux-kernel - Re: Question about kill a process group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87mtfmhap2.fsf@email.froward.int.ebiederm.org>
Date:   Thu, 12 May 2022 13:23:21 -0500
From:   "Eric W. Biederman" <ebiederm@...ssion.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Zhang Qiao <zhangqiao22@...wei.com>,
        lkml <linux-kernel@...r.kernel.org>, keescook@...omium.org,
        Peter Zijlstra <peterz@...radead.org>, elver@...gle.com,
        legion@...nel.org, oleg@...hat.com, brauner@...nel.org
Subject: Re: Question about kill a process group

Thomas Gleixner <tglx@...utronix.de> writes:

> On Wed, May 11 2022 at 13:33, Eric W. Biederman wrote:
>> Thomas Gleixner <tglx@...utronix.de> writes:
>>> So unless the number of PIDs for a user is limited this _is_ an
>>> unpriviledged DoS vector.
>>
>> After having slept on this a bit it finally occurred to me the
>> semi-obvious solution to this issue is to convert tasklist_lock
>> from a rw-spinlock to rw-semaphore.  The challenge is finding
>> the users (tty layer?) that generate signals from interrupt
>> context and redirect that signal generation.
>
> From my outdated notes where I looked at this before:
>
> [soft]interrupt context which acquires tasklist_lock:
> sysrq-e		send_sig_all()
> sysrq-i         send_sig_all()
> sysrq-n		normalize_rt_tasks()
>
> tasklist_lock nesting into other locks:
>    fs/fcntl.c: send_sigio(), send_sigurg()
>
>    send_sigurg() is called from the network stack ...
>
> Some very obscure stuff in arch/ia64/kernel/mca.c which is called
> from a DIE notifier.

I think we are very close to the point that if ia64 is the only user
problem case we can just do "git rm arch/ia64".  I am not certain
there is even anyone left that cares enough to report breakage
on ia64.

> Plus quite a bunch of read_lock() instances which nest inside
> rcu_read_lock() held sections.
>
> This is probably incomplete, but the scope of the problem has been
> greatly reduced vs. the point where I looked at it last time a couple of
> years ago. But that's still a herculean task.

I won't argue.

>> Once signals holding tasklist_lock are no longer generated from
>> interrupt context irqs no longer need to be disabled and
>> after verifying tasklist_lock isn't held under any other spinlocks
>> it can be converted to a semaphore.
>
> Going to take a while. :)

It is a very tractable problem that people can work on incrementally.

>> It won't help the signal delivery times, but it should reduce
>> the effect on the rest of the system, and prevent watchdogs from
>> firing.
>
> The signal delivery time itself is the least of the worries, but this
> still prevents any other operations which require tasklist_lock from
> making progress for quite some time, i.e. fork/exec for unrelated
> processes/users will have to wait too. So you converted the 'visible'
> DoS to an 'invisible' one.
>
> The real problem is that the scope of tasklist_lock is too broad for
> most use cases. That does not change when you actually can convert it to
> a rwsem. The underlying problem still persists.
>
> Let's take a step back and look what most sane use cases (sysrq-* is not
> in that category) require:
>
>    Preventing that tasks are added or removed
>
> Do they require that any task is added or removed? No.
>
>    They require to prevent add/remove for the intended scope.
>
> That's the thing we need to focus on: reducing the protection scope.
>
> If we can segment the protection for the required scope of e.g. kill(2)
> then we still can let unrelated processes/tasks make progress and just
> inflict the damage on the affected portion of processes/tasks.
>
> For example:
>
> 	read_lock(&tasklist_lock);
> 	for_each_process(p) {
> 		if (task_pid_vnr(p) > 1 &&
> 		    !same_thread_group(p, current)) {
>
> 			group_send_sig_info(...., p);
> 		}
> 	}
> 	read_unlock(&tasklist_lock);
>
> same_thread_group() does:
>
>    return p->signal == current->signal;

Yes.  So the sender can not send a signal to itself.
Basically it is a test to see if a thread is a member of a process.

> Ideally we can do:
>
> 	read_lock(&tasklist_lock);
>         prevent_add_remove(current->signal);
> 	read_unlock(&tasklist_lock);
>
>         rcu_read_lock();
> 	for_each_process(p) {
> 		if (task_pid_vnr(p) > 1 &&
> 		    !same_thread_group(p, current)) {
>
> 			group_send_sig_info(...., p);
> 		}
> 	}
>         rcu_read_unlock();
>
>         allow_add_remove(current->signal);
>
> Where prevent_add_remove() sets a state which has to be waited for to be
> cleared by anything which wants to add/remove a task in that scope or
> change $relatedtask->signal until allow_add_remove() removes that
> blocker. I'm sure it's way more complicated, but you get the idea.

Hmm.

For sending signals what is needed is the guarantee that the signal is
sent to an atomic snapshot of the appropriate group of processes so that
SIGKILL sent to the group will reliably kill all of the processes.  It
should be ok for a process to exit on it's own from the group.  As long
as it logically looks like the process exited before the signal was
sent.

There is also ptrace_attach/__ptrace_unlink, reparenting,
kill_orphaned_pgrp, zap_pid_ns_processes, and pid hash table
maintenance in release_task.

I have a patch I am playing with that protects task->parent and
task->real_parent with siglock and with a little luck that can
be generalized so that sending signals to parents and ptrace don't
need tasklist_lock at all.

For reparenting of children the new parents list of children
needs protection but that should not need tasklist lock.

For kill_orphaned_pgrp with some additional per process group
maintenance state so that will_become_orphaned_pgrp and has_stopped_jobs
don't need to traverse the process group it should be possible to
just have it send a sender of a process group signal.

zap_pid_ns_processes is already called without the tasklist_lock.

Maintenance of the pid hash table certainly needs a write lock in
__exit_signal but it doesn't need to be tasklist_lock.





Which is a long way of saying semantically all we need is to
prevent_addition to the group of processes a signal will be sent to.  We
have one version of that prevention today in fork where it tests
fatal_signal_pending after taking tasklist_lock and siglock.  For the
case you are describing the code would just need to check each group of
processes the new process is put into.


Hmm.

When I boil it all down in my head I wind up with something like:

	rwlock_t *lock = signal_lock(enum pid_type type);
        read_lock(lock);
        	/* Do the work of sending the signal */
        read_unlock(lock);

With fork needing to grab all of those possible locks for write
as it adds the process to the group.

Maybe it could be something like:

	struct group_signal {
        	struct hlist_node node;
                struct kernel_siginfo *info;
        };

	void start_group_signal(struct group_signal *group, struct
				kernel_siginfo *info, enum pid_type type);
	void end_group_signal(struct group_signal *group);

	struct group_signal group_sig;
	start_group_signal(&group_sig, info, PIDTYPE_PGID);

	/* Walk through the list and deliver the signal */

	end_group_signal(&group_sig);

That would allow fork to see all signals that are being delivered to a
group even it the signal has not been delivered to the parent process
yet.  At which point the signal could be delivered to the parent before
the fork.  I just need something to ensure that the signal delivery loop
between start_group_signal and end_group_signal skips processes that
hurried up and delivered the signal to themselves, and does not
deliver to newly added processes.  A generation counter perhaps.

There is a lot to flesh out, and I am burried alive in other cleanups
but I think that could work, and remove the need to hold tasklist_lock
during signal delivery.


> If we find a solution to this scope reduction problem, then it will not
> only squash the issue which started this discussion. This will have a
> benefit in general.

We need to go farther than simple scope reduction to benefit the
original problem.  As all of the process in that problem were
all sending a signal to the same process group.  So they all needed
to wait for each other.

If we need to block adds then the adds need to effectively take a
write_lock to the read_lock taken during signal delivery.  Because
all of the blocking is the same we won't see an improvement of
the original problem.



If in addition to scope reduction, a barrier is implemented so that
it is guaranteed that past a certain point processes will see the signal
before they fork (or do anything else that userspace could tell the
signal was not delivered atomically) then I think we can eliminate
blocking in the same places and an improvement in the issue that
started this discussion can be seen.


I will queue it up on my list of things I would like to do.  I am
burried in other signal related cleanups at the moment so I don't know
when I will be able to get to anything like that.  But I really
appreciate the idea.

Eric