linux-kernel - Re: [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190915140911.GA17248@paulmck-ThinkPad-P72>
Date:   Sun, 15 Sep 2019 07:09:11 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Christoph Lameter <cl@...ux.com>,
        Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v2 2/4] task: Ensure tasks are available for a grace
 period after leaving the runqueue

On Sun, Sep 15, 2019 at 07:07:52AM -0700, Paul E. McKenney wrote:
> On Sat, Sep 14, 2019 at 07:33:58AM -0500, Eric W. Biederman wrote:
> > 
> > In the ordinary case today the rcu grace period for a task_struct is
> > triggered when another process wait's for it's zombine and causes the

Oh, and "waits for its", just to hit the grammar en passant...  ;-)

							Thanx, Paul

> > kernel to call release_task().  As the waiting task has to receive a
> > signal and then act upon it before this happens, typically this will
> > occur after the original task as been removed from the runqueue.
> > 
> > Unfortunaty in some cases such as self reaping tasks it can be shown
> > that release_task() will be called starting the grace period for
> > task_struct long before the task leaves the runqueue.
> > 
> > Therefore use put_task_struct_rcu_user in finish_task_switch to
> > guarantee that the there is a rcu lifetime after the task
> > leaves the runqueue.
> > 
> > Besides the change in the start of the rcu grace period for the
> > task_struct this change may cause perf_event_delayed_put and
> > trace_sched_process_free.  The function perf_event_delayed_put boils
> > down to just a WARN_ON for cases that I assume never show happen.  So
> > I don't see any problem with delaying it.
> > 
> > The function trace_sched_process_free is a trace point and thus
> > visible to user space.  Occassionally userspace has the strangest
> > dependencies so this has a miniscule chance of causing a regression.
> > This change only changes the timing of when the tracepoint is called.
> > The change in timing arguably gives userspace a more accurate picture
> > of what is going on.  So I don't expect there to be a regression.
> > 
> > In the case where a task self reaps we are pretty much guaranteed that
> > the rcu grace period is delayed.  So we should get quite a bit of
> > coverage in of this worst case for the change in a normal threaded
> > workload.  So I expect any issues to turn up quickly or not at all.
> > 
> > I have lightly tested this change and everything appears to work
> > fine.
> > 
> > Inspired-by: Linus Torvalds <torvalds@...ux-foundation.org>
> > Inspired-by: Oleg Nesterov <oleg@...hat.com>
> > Signed-off-by: "Eric W. Biederman" <ebiederm@...ssion.com>
> > ---
> >  kernel/fork.c       | 11 +++++++----
> >  kernel/sched/core.c |  2 +-
> >  2 files changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9f04741d5c70..7a74ade4e7d6 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> >  	if (orig->cpus_ptr == &orig->cpus_mask)
> >  		tsk->cpus_ptr = &tsk->cpus_mask;
> >  
> > -	/* One for the user space visible state that goes away when reaped. */
> > -	refcount_set(&tsk->rcu_users, 1);
> > -	/* One for the rcu users, and one for the scheduler */
> > -	refcount_set(&tsk->usage, 2);
> > +	/*
> > +	 * One for the user space visible state that goes away when reaped.
> > +	 * One for the scheduler.
> > +	 */
> > +	refcount_set(&tsk->rcu_users, 2);
> 
> OK, this would allow us to add a later decrement-and-test of
> ->rcu_users ...
> 
> > +	/* One for the rcu users */
> > +	refcount_set(&tsk->usage, 1);
> >  #ifdef CONFIG_BLK_DEV_IO_TRACE
> >  	tsk->btrace_seq = 0;
> >  #endif
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b037f195473..69015b7c28da 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >  		/* Task is done with its stack. */
> >  		put_task_stack(prev);
> >  
> > -		put_task_struct(prev);
> > +		put_task_struct_rcu_user(prev);
> 
> ... which is here.  And this looks to be invoked from the __schedule()
> called from do_task_dead() at the very end of do_exit().
> 
> This looks plausible, but still requires that it no longer be possible to
> enter an RCU read-side critical section that might increment ->rcu_users
> after this point in time.  This might be enforced by a grace period
> between the time that the task was removed from its lists and the current
> time (seems unlikely, though, in that case why bother with call_rcu()?) or
> by some other synchronization.
> 
> On to the next patch!
> 
> 							Thanx, Paul
> 
> >  	}
> >  
> >  	tick_nohz_task_switch();
> > -- 
> > 2.21.0.dirty
> >