linux-kernel - Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160511123345.GD3192@twins.programming.kicks-ass.net>
Date:	Wed, 11 May 2016 14:33:45 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Matt Fleming <matt@...eblueprint.co.uk>
Cc:	mingo@...nel.org, linux-kernel@...r.kernel.org, clm@...com,
	mgalbraith@...e.de, tglx@...utronix.de, fweisbec@...il.com,
	srikar@...ux.vnet.ibm.com, mikey@...ling.org, anton@...ba.org
Subject: Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with
 sched_domain_shared

On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
> >  	int cpu = smp_processor_id();
> >  
> >  	rcu_read_lock();
> > -	sd = rcu_dereference(per_cpu(sd_busy, cpu));
> > +	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >  
> >  	if (!sd || !sd->nohz_idle)
> >  		goto unlock;
> >  	sd->nohz_idle = 0;
> >  
> > -	atomic_inc(&sd->groups->sgc->nr_busy_cpus);
> > +	atomic_inc(&sd->shared->nr_busy_cpus);
> >  unlock:
> >  	rcu_read_unlock();
> >  }
> 
> This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,
> 

Hmm, PPC folks; what does your topology look like?

Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
seems to suggest your cores do not share cache at all.

https://en.wikipedia.org/wiki/POWER7 seems to agree and states

  "4 MB L3 cache per C1 core"

And http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf
also explicitly draws pictures with the L3 per core.

_however_, that same document describes L3 inter-core fill and lateral
cast-out, which sounds like the L3s work together to form a node wide
caching system.

Do we want to model this co-operative L3 slices thing as a sort of
node-wide LLC for the purpose of the scheduler ?

While we should definitely fix the assumption that an LLC exists (and I
need to look at why it isn't set to the core domain instead as well),
the scheduler does try and scale things by 'assuming' LLC := node.

It does this for NOHZ, and these here patches under discussion would be
doing the same for idle-core state.

Would this make sense for power, or should we somehow think of something
else?