linux-kernel - Re: sched: tweak select_idle_sibling to look for idle threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160503151153.wp6jcnjadmw2ypmx@floor.masoncoding.com>
Date:	Tue, 3 May 2016 11:11:53 -0400
From:	Chris Mason <clm@...com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	Mike Galbraith <mgalbraith@...e.de>,
	Ingo Molnar <mingo@...nel.org>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	<linux-kernel@...r.kernel.org>
Subject: Re: sched: tweak select_idle_sibling to look for idle threads

On Tue, May 03, 2016 at 04:32:25PM +0200, Peter Zijlstra wrote:
> On Mon, May 02, 2016 at 11:47:25AM -0400, Chris Mason wrote:
> > On Mon, May 02, 2016 at 04:58:17PM +0200, Peter Zijlstra wrote:
> > > On Mon, May 02, 2016 at 04:50:04PM +0200, Mike Galbraith wrote:
> > > > Oh btw, did you know single socket boxen have no sd_busy?  That doesn't
> > > > look right.
> > > 
> > > I suspected; didn't bother looking at yet. The 'problem' is that the LLC
> > > domain is the top-most, so it doesn't have a parent domain. I'm sure we
> > > can come up with something if we can get this all working right.
> > > 
> > > And yes, I can get gains on various workloads with various options, I
> > > can even break all workloads, but I've so far completely failed on
> > > getting a win for everyone :/
> > 
> > Adding in the task_hot() check to decide if scanning idle was a good
> > idea ended up being really important
> 
> So I'm conflicted on this patch:
> 
> +static int bounce_to_target(struct task_struct *p, int cpu)
> +{
> +       s64 delta;
> +
> +       /*
> +        * as the run queue gets bigger, its more and more likely that
> +        * balance will have distributed things for us, and less likely
> +        * that scanning all our CPUs for an idle one will find one.
> +        * So, if nr_running > 1, just call this CPU good enough
> +        */
> +       if (cpu_rq(cpu)->cfs.nr_running > 1)
> +               return 1;

The nr_running check is interesting.  It is supposed to give the same
benefit as your "do we have anything idle?" variable, but without having
to constantly update a variable somewhere.  I'll have to do a few runs
to verify (maybe a idle_scan_failed counter).

> +
> +       /* taken from task_hot() */
> +       delta = rq_clock_task(task_rq(p)) - p->se.exec_start;
> +       return delta < (s64)sysctl_sched_migration_cost;
> +}
> 
> This will work for you schbench workload because it sleep for 30ms while
> the migration_cost thingy is 500us, therefore you'll trigger the full
> LLC scan.

The task_hot checks don't do much for the sleeping schbench runs, but
they help a lot for this:

# pick a single core, in my case cpus 0,20 are the same core
# cpu_hog is any program that spins
#
taskset -c 20 cpu_hog &

# schbench -p 4 means message passing mode with 4 byte messages (like
# pipe test), no sleeps, just bouncing as fast as it can.
#
# make the scheduler choose between the sibling of the hog and cpu 1
#
taskset -c 0,1 schbench -p 4 -m 1 -t 1

Current mainline will stuff both schbench threads onto CPU 1, leaving
CPU 0 100% idle.  My first patch with the minimal task_hot() checks
would sometimes pick CPU 0.  My second patch that just directly calls
task_hot sticks to cpu1, which is ~3x faster than spreading it.

The full task_hot() checks also really help tbench.

> 
> _However_, the migration_cost is supposed the model the cost of leaving
> the LLC, so testing against that here seems wrong.
> 
> Let me go play with something that measures the cost of doing that LLC
> scan and compares that against the sleepy time -- of course, now need to
> go figure out how to do this clock thing without rq-lock pain.
> 
> 
> 
> +       if (package_sd && !bounce_to_target(p, target)) {
> +               for_each_cpu_and(i, sched_domain_span(package_sd), tsk_cpus_allowed(p)) {
> +                       if (idle_cpu(i)) {
> +                               target = i;
> +                               break;
> +                       }
> +
> +               }
> +       }
> 
> Also note your s/sd/package_sd/ rename is, strictly speaking, wrong.
> Sure, on your current Intel system the LLC is the entire package, but
> this is not true in general.
> 
> Take for instance the Intel Core2Quad and AMD Bulldozer thingies, they
> had two dies in one package, and correspondingly two LLC domains in one
> package.
> 
> (also, the Intel cluster-on-die thing can split the thing in two)
> 
> There were also the old P6 era SMP boards which had external LLC, where
> you could have an LLC shared across multiple packages -- although I'm
> thinking we'll never see that again, due to off package being far
> toooooo slooooooow these days.

Gotcha, makes sense.  I'll switch to llc_sd ;)

-chris