linux-kernel - Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201014234053.GA86158@lothringen>
Date:   Thu, 15 Oct 2020 01:40:53 +0200
From:   Frederic Weisbecker <frederic@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Marcelo Tosatti <mtosatti@...hat.com>,
        linux-kernel@...r.kernel.org,
        Nitesh Narayan Lal <nitesh@...hat.com>,
        Peter Xu <peterx@...hat.com>
Subject: Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a
 task

On Wed, Oct 14, 2020 at 10:33:21AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2020 at 02:13:28PM -0300, Marcelo Tosatti wrote:
> 
> > > Yes but if the task isn't running, run_posix_cpu_timers() doesn't have
> > > anything to elapse. So indeed we can spare the IPI if the task is not
> > > running. Provided ordering makes sure that the task sees the new dependency
> > > when it schedules in of course.
> > 
> > True.
> > 
> >  * p->on_cpu <- { 0, 1 }:
> >  *
> >  *   is set by prepare_task() and cleared by finish_task() such that it will be
> >  *   set before p is scheduled-in and cleared after p is scheduled-out, both
> >  *   under rq->lock. Non-zero indicates the task is running on its CPU.
> > 
> > 
> > CPU-0 (tick_set_dep)            CPU-1 (task switch)
> > 
> > STORE p->tick_dep_mask
> > smp_mb() (atomic_fetch_or())
> > LOAD p->on_cpu
> > 
> > 
> >                                 context_switch(prev, next)
> >                                 STORE next->on_cpu = 1
> >                                 ...                             [*]
> > 
> >                                 LOAD current->tick_dep_mask
> > 
> 
> That load is in tick_nohz_task_switch() right? (which BTW is placed
> completely wrong) You could easily do something like the below I
> suppose.
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 81632cd5e3b7..2a5fafe66bb0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -410,6 +410,14 @@ void __tick_nohz_task_switch(void)
>  	ts = this_cpu_ptr(&tick_cpu_sched);
>  
>  	if (ts->tick_stopped) {
> +		/*
> +		 * tick_set_dep()		(this)
> +		 *
> +		 * STORE p->tick_dep_mask	STORE p->on_cpu
> +		 * smp_mb()			smp_mb()
> +		 * LOAD p->on_cpu		LOAD p->tick_dep_mask
> +		 */
> +		smp_mb();
>  		if (atomic_read(&current->tick_dep_mask) ||
>  		    atomic_read(&current->signal->tick_dep_mask))
>  			tick_nohz_full_kick();

It would then need to be unconditional (whatever value of ts->tick_stopped).
Assuming the tick isn't stopped, we may well have an interrupt firing right
after schedule() which doesn't see the new value of tick_dep_map.

Alternatively, we could rely on p->on_rq which is set to TASK_ON_RQ_QUEUED
at wake up time, prior to the schedule() full barrier. Of course that doesn't
mean that the task is actually the one running on the CPU but it's a good sign,
considering that we are running in nohz_full mode and it's usually optimized
for single task mode.

Also setting a remote task's tick dependency is only used by posix cpu timer
in case the user has the bad taste to enqueue on a task running in nohz_full
mode. It shouldn't deserve an unconditional full barrier in the schedule path.

If the target is current, as is used by RCU, I guess we can keep a special
treatment.

> re tick_nohz_task_switch() being placed wrong, it should probably be
> placed before finish_lock_switch(). Something like so.
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cf044580683c..5c92c959824f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4084,6 +4084,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  	vtime_task_switch(prev);
>  	perf_event_task_sched_in(prev, current);
>  	finish_task(prev);
> +	tick_nohz_task_switch();
>  	finish_lock_switch(rq);
>  	finish_arch_post_lock_switch();
>  	kcov_finish_switch(current);
> @@ -4121,7 +4122,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		put_task_struct_rcu_user(prev);
>  	}
>  
> -	tick_nohz_task_switch();

IIRC, we wanted to keep it outside rq lock because it shouldn't it...