linux-kernel - Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160713155434.GA21816@e105550-lin.cambridge.arm.com>
Date:	Wed, 13 Jul 2016 16:54:35 +0100
From:	Morten Rasmussen <morten.rasmussen@....com>
To:	Vincent Guittot <vincent.guittot@...aro.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	"mingo@...hat.com" <mingo@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Yuyang Du <yuyang.du@...el.com>, mgalbraith@...e.de,
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity
 support

On Wed, Jul 13, 2016 at 02:06:17PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@....com> wrote:
> > Hi,
> >
> > The scheduler is currently not doing much to help performance on systems with
> > asymmetric compute capacities (read ARM big.LITTLE). This series improves the
> > situation with a few tweaks mainly to the task wake-up path that considers
> > compute capacity at wake-up and not just whether a cpu is idle for these
> > systems. This gives us consistent, and potentially higher, throughput in
> > partially utilized scenarios. SMP behaviour and performance should be
> > unaffected.
> >
> > Test 0:
> >         for i in `seq 1 10`; \
> >                do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
> >                done \
> >         | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> >                END {print "Average events: " sum/runs}'
> >
> > Target: ARM TC2 (2xA15+3xA7)
> >
> >         (Higher is better)
> > tip:    Average events: 146.9
> > patch:  Average events: 217.9
> >
> > Test 1:
> >         perf stat --null --repeat 10 -- \
> >         perf bench sched messaging -g 50 -l 5000
> >
> > Target: Intel IVB-EP (2*10*2)
> >
> > tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> > patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> >
> > Target: ARM TC2 A7-only (3xA7) (-l 1000)
> >
> > tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> > patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> >
> > More analysis:
> >
> > Statistics from mixed periodic task workload (rt-app) containing both
> > big and little task, single run on ARM TC2:
> >
> > tu   = Task utilization big/little
> > pcpu = Previous cpu big/little
> > tcpu = This (waker) cpu big/little
> > dl   = New cpu is little
> > db   = New cpu is big
> > sis  = New cpu chosen by select_idle_sibling()
> > figc = New cpu chosen by find_idlest_*()
> > ww   = wake_wide(task) count for figc wakeups
> > bw   = sd_flag & SD_BALANCE_WAKE (non-fork/exec wake)
> >        for figc wakeups
> >
> > case tu   pcpu tcpu   dl   db  sis figc   ww   bw
> > 1    l    l    l     122   68   28  162  161  161
> > 2    l    l    b      11    4    0   15   15   15
> > 3    l    b    l       0  252    8  244  244  244
> > 4    l    b    b      36 1928  711 1253 1016 1016
> > 5    b    l    l       5   19    0   24   22   24
> > 6    b    l    b       5    1    0    6    0    6
> > 7    b    b    l       0   31    0   31   31   31
> > 8    b    b    b       1  194  109   86   59   59
> > --------------------------------------------------
> >                      180 2497  856 1821
> 
> I'm not sure to know how to interpret all these statistics

Thanks for looking into the details. Let me provide a bit more context.

After our discussion around v1 I wanted to understand how the patches
works with different combinations of task utilization, prev_cpu, and
waking cpu. IIRC, the outcome our discussion was that tasks with
utilization too high to fit little cpus should go on big cpus, tasks
small enough to fit anywhere can go anywhere. For the latter we don't
want to spend too much time placing them as they essentially don't care
so they can be placed using select_idle_sibling().

So, I created a workload with rt-app with a number of different periodic
tasks with different periods and busy times. I traced all wake-ups and
put them into eight categories depending on the wake-up scenario, i.e.
task utilization, prev_cpu, and waking cpu (tu, pcpu, and tcpu).

The next two columns (dl, db) show the number of wake-ups that ended up
on a little or big cpu. If we take case 1 as an example, we had 190
wake-ups in total where a little task last ran on a little cpu and was
woken up by a little cpu. 122 of the wake-ups ended up selecting a
little cpu again, while in 68 cases it went to a big cpu. That is fine
according to our scheduling policy above.

The sis and figc columns show the split between wake-ups being handled
by select_idle_sibling() versus find_idlest_*(). Coming back to case 1,
28 wake-ups where handled by the former and 162 by the latter. We can't
really say exactly how many of select_idle_sibling() wake-ups that ended
up on big or little, but based on the numbers it is clear that in some
cases find_idlest_*() chose a big cpu for a little task (68 > 28).

The last two columns, ww and bw, try to explain why we have so many
wake-ups handled by find_idlest_*() for cases (1-4, and 8) where we
could have used select_idle_sibling(). The bw number is the number of
find_idlest_*() wake-ups that were passed the SD_BALANCE_WAKE flag (i.e.
non-FORK and non-EXEC wake-ups). FORK and EXEC wake-ups always take the
find_idlest_*() route, so we should ignore those. For case 1 it turned
out that only one of the 162 figc wake-ups was a FORK/EXEC wake-up. So
something else mush have caused those wake-ups to not go via
select_idle_sibling(). The ww column explains why as it shows how many
of the figc wake-ups where wake_wide() returned true and therefore
disabled want_affine. Because we have enabled SD_BALANCE_WAKE on the
sched_domains, !wake_affine wake-ups no longer end up using
select_idle_sibling() anyway, but end up using find_idlest_*().

Thinking more about it, should we force those tasks to use
select_idle_sibling() anyway? Something like the below could do it I
think:

@@ -5444,7 +5444,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
                        new_cpu = cpu;
        }
 
-       if (!sd) {
+       if (!sd || (!wake_cap(p, cpu, prev_cpu) && sd_flags & SD_BALANCE_WAKE)) {
                if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
                        new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

Ideally, cases 5-7 should be handled by find_idlest_*(), which seems to
hold true in the table above, and cases 1-4, and 8 should be handled by
select_idle_sibling(), which isn't always the case due to wake_wide().

Thoughts?