linux-kernel - Re: [PATCH v2 02/13] sched/fair: Consistent use of prev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160623095613.GA5606@e105550-lin.cambridge.arm.com>
Date:	Thu, 23 Jun 2016 10:56:14 +0100
From:	Morten Rasmussen <morten.rasmussen@....com>
To:	Rik van Riel <riel@...hat.com>
Cc:	peterz@...radead.org, mingo@...hat.com, dietmar.eggemann@....com,
	yuyang.du@...el.com, vincent.guittot@...aro.org,
	mgalbraith@...e.de, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in
 wakeup path

On Wed, Jun 22, 2016 at 02:04:11PM -0400, Rik van Riel wrote:
> On Wed, 2016-06-22 at 18:03 +0100, Morten Rasmussen wrote:
> > In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
> > select_task_rq() got a 'cpu' argument to enable overriding of
> > prev_cpu
> > in special cases (NUMA task swapping). However, the
> > select_task_rq_fair() helper functions: wake_affine() and
> > select_idle_sibling(), still use task_cpu(p) directly to work out
> > prev_cpu which leads to inconsistencies.
> > 
> > This patch passes prev_cpu (potentially overridden by NUMA code) into
> > the helper functions to ensure prev_cpu is indeed the same cpu
> > everywhere in the wakeup path.
> > 
> > cc: Ingo Molnar <mingo@...hat.com>
> > cc: Peter Zijlstra <peterz@...radead.org>
> > cc: Rik van Riel <riel@...hat.com>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@....com>
> > ---
> >  kernel/sched/fair.c | 24 +++++++++++++-----------
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c6dd8bab010c..eec8e29104f9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq,
> > struct sched_entity *se)
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > -static int select_idle_sibling(struct task_struct *p, int cpu);
> > +static int select_idle_sibling(struct task_struct *p, int prev_cpu,
> > int cpu);
> >  static unsigned long task_h_load(struct task_struct *p);
> >  
> >  /*
> > @@ -1483,7 +1483,8 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> >  	 * Call select_idle_sibling to maybe find a better one.
> >  	 */
> >  	if (!cur)
> > -		env->dst_cpu = select_idle_sibling(env->p, env-
> > >dst_cpu);
> > +		env->dst_cpu = select_idle_sibling(env->p, env-
> > >src_cpu,
> > +						   env->dst_cpu);
> 
> It is worth remembering that "prev" will only
> ever be returned by select_idle_sibling() if
> it is part of the same NUMA node as target.
> 
> That means this patch does not change behaviour
> of the NUMA balancing code, since that always
> migrates between nodes.
> 
> Now lets look at try_to_wake_up(). It will pass
> p->wake_cpu as the argument for "prev_cpu", which
> again appears to be the same CPU number as that used
> by the current code.

IIUC, p->wake_cpu != task_cpu(p) if task_numa_migrate() decided to call
migrate_swap() on the task while it was sleeping intending it to swap
places with a task on a different NUMA node when it wakes up. Using
p->wake_cpu in select_idle_sibling() as "prev_cpu" when called through
try_to_wake_up()->select_task_rq() should only make a difference if the
target cpu happens to share cache with it and it is idle.

	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
		return prev;

The selection of the target cpu for select_idle_sibling() is also
slightly affected as wake_affine() currently compares task_cpu(p) and
smp_processor_id(), and then picks p->wake_cpu or smp_processor_id()
depending on the outcome. With this patch wake_affine() uses
p->wake_cpu instead of task_cpu(p) so we actually compare the candidates
we choose between.

I think that would lead to some minor changes in behaviour in a few
corner cases, but I mainly wrote the patch as I thought it was very
confusing that we could have different "prev_cpu"s in different parts of
the select_task_rq_fair() code path.

> 
> I have no objection to your patch, but must be
> overlooking something, since I cannot find a change
> in behaviour that your patch would create.

Thanks for confirming that it shouldn't change anything for NUMA load
balancing. That is what I hope for :-)