linux-kernel - Re: [PATCH v2 3/3] prlimit: do not grab the tasklist

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874k6h7s8o.fsf@email.froward.int.ebiederm.org>
Date:   Wed, 05 Jan 2022 16:05:59 -0600
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Barret Rhoden <brho@...gle.com>
Cc:     Christian Brauner <christian.brauner@...ntu.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexey Gladkov <legion@...nel.org>,
        William Cohen <wcohen@...hat.com>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Alexey Dobriyan <adobriyan@...il.com>,
        Chris Hyser <chris.hyser@...cle.com>,
        Peter Collingbourne <pcc@...gle.com>,
        Xiaofeng Cao <caoxiaofeng@...ong.com>,
        David Hildenbrand <david@...hat.com>,
        Cyrill Gorcunov <gorcunov@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 3/3] prlimit: do not grab the tasklist_lock

Barret Rhoden <brho@...gle.com> writes:

> Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
> for workloads that also must grab the tasklist_lock for waiting,
> killing, and cloning.
>
> The tasklist_lock was grabbed to protect tsk->sighand from disappearing
> (becoming NULL).  tsk->signal was already protected by holding a
> reference to tsk.
>
> update_rlimit_cpu() assumed tsk->sighand != NULL.  With this commit, it
> attempts to lock_task_sighand().  However, this means that
> update_rlimit_cpu() can fail.  This only happens when a task is exiting.
> Note that during exec, sighand may *change*, but it will not be NULL.
>
> Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
> would not fail by read locking the tasklist_lock and checking tsk->sighand
> != NULL.
>
> If update_rlimit_cpu() fails, there may be other tasks that are not
> exiting that share tsk->signal.  We need to run update_rlimit_cpu() on
> one of them.   We can't "back out" the new rlim - once we unlocked
> task_lock(group_leader), the rlim is essentially changed.
>
> The only other caller of update_rlimit_cpu() is
> selinux_bprm_committing_creds().  It has tsk == current, so
> update_rlimit_cpu() cannot fail (current->sighand cannot disappear
> until current exits).
>
> This change resulted in a 14% speedup on a microbenchmark where parents
> kill and wait on their children, and children getpriority, setpriority,
> and getrlimit.
>
> Signed-off-by: Barret Rhoden <brho@...gle.com>
> ---
>  include/linux/posix-timers.h   |  2 +-
>  kernel/sys.c                   | 32 +++++++++++++++++++++-----------
>  kernel/time/posix-cpu-timers.c | 12 +++++++++---
>  3 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
> index 5bbcd280bfd2..9cf126c3b27f 100644
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -253,7 +253,7 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
>  void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
>  			   u64 *newval, u64 *oldval);
>  
> -void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
> +int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
>  
>  void posixtimer_rearm(struct kernel_siginfo *info);
>  #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index fb2a5e7c0589..073ae9db192f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1432,13 +1432,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>  			return -EPERM;
>  	}
>  
> -	/* protect tsk->signal and tsk->sighand from disappearing */
> -	read_lock(&tasklist_lock);
> -	if (!tsk->sighand) {
> -		retval = -ESRCH;
> -		goto out;
> -	}
> -
> +	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
>  	rlim = tsk->signal->rlim + resource;
>  	task_lock(tsk->group_leader);
>  	if (new_rlim) {
> @@ -1467,10 +1461,26 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>  	 */
>  	if (!retval && new_rlim && resource == RLIMIT_CPU &&
>  	    new_rlim->rlim_cur != RLIM_INFINITY &&
> -	    IS_ENABLED(CONFIG_POSIX_TIMERS))
> -		update_rlimit_cpu(tsk, new_rlim->rlim_cur);
> -out:
> -	read_unlock(&tasklist_lock);
> +	    IS_ENABLED(CONFIG_POSIX_TIMERS)) {
> +		if (update_rlimit_cpu(tsk, new_rlim->rlim_cur)) {
> +			/*
> +			 * update_rlimit_cpu can fail if the task is exiting.
> +			 * We already set the task group's rlim, so we need to
> +			 * update_rlimit_cpu for some other task in the process.
> +			 * If all of the tasks are exiting, then we don't need
> +			 * to update_rlimit_cpu.
> +			 */
> +			struct task_struct *t_i;
> +
> +			rcu_read_lock();
> +			for_each_thread(tsk, t_i) {
> +				if (!update_rlimit_cpu(t_i, new_rlim->rlim_cur))
> +					break;
> +			}
> +			rcu_read_unlock();
> +		}

I look at this and I ask can't we do this better?

Because you are right that if the thread you landed on is exiting this
is a problem.  It is only a problem for prlimit64, as all of the rest
of the calls to do_prlimit happen from current so you know they are not
exiting.

I think the simple solution is just:
	update_rlimit_cpu(tsk->group_leader)

As the group leader is guaranteed to be the last thread of the thread
group to be processed in release_task, and thus the last thread with a
sighand.  Nothing needs to be done if it does not have a sighand.

How does that sound?

> +	}
> +
>  	return retval;
>  }

Eric