linux-kernel - RE: Bench for testing scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8251B150E4DF5041A62C3EA9F0AB2E060255308A9E7D@SELDMBX99.corpusers.net>
Date:	Fri, 8 Nov 2013 22:12:27 +0100
From:	"Rowand, Frank" <Frank.Rowand@...ymobile.com>
To:	Vincent Guittot <vincent.guittot@...aro.org>
CC:	"catalin.marinas@....com" <catalin.marinas@....com>,
	"Morten.Rasmussen@....com" <Morten.Rasmussen@....com>,
	"alex.shi@...aro.org" <alex.shi@...aro.org>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"pjt@...gle.com" <pjt@...gle.com>,
	"mingo@...nel.org" <mingo@...nel.org>,
	"rjw@...ysocki.net" <rjw@...ysocki.net>,
	"srivatsa.bhat@...ux.vnet.ibm.com" <srivatsa.bhat@...ux.vnet.ibm.com>,
	"paul@...an.com" <paul@...an.com>,
	"mgorman@...e.de" <mgorman@...e.de>,
	"juri.lelli@...il.com" <juri.lelli@...il.com>,
	"fengguang.wu@...el.com" <fengguang.wu@...el.com>,
	"markgross@...gnar.org" <markgross@...gnar.org>,
	"khilman@...aro.org" <khilman@...aro.org>,
	"paulmck@...ux.vnet.ibm.com" <paulmck@...ux.vnet.ibm.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: Bench for testing scheduler


On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guittot@...aro.org] wrote:
> 
> On 8 November 2013 01:04, Rowand, Frank <Frank.Rowand@...ymobile.com> wrote:
> > Hi Vincent,
> >
> > Thanks for creating some benchmark numbers!
> 
> you're welcome
> 
> >
> >
> > On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guittot@...aro.org] wrote:
> >>
> >> On 7 November 2013 12:32, Catalin Marinas <catalin.marinas@....com> wrote:
> >> > Hi Vincent,
> >> >
> >> > (for whatever reason, the text is wrapped and results hard to read)
> >>
> >> Yes, i have just seen that. It looks like gmail has wrapped the lines.
> >> I have added the results which should not be wrapped, at the end of this email
> >>
> >> >
> >> >
> >> > On Thu, Nov 07, 2013 at 10:54:30AM +0000, Vincent Guittot wrote:
> >> >> During the Energy-aware scheduling mini-summit, we spoke about benches
> >> >> that should be used to evaluate the modifications of the scheduler.
> >> >> I’d like to propose a bench that uses cyclictest to measure the wake
> >> >> up latency and the power consumption. The goal of this bench is to
> >> >> exercise the scheduler with various sleeping period and get the
> >> >> average wakeup latency. The range of the sleeping period must cover
> >> >> all residency times of the idle state table of the platform. I have
> >> >> run such tests on a tc2 platform with the packing tasks patchset.
> >> >> I have use the following command:
> >> >> #cyclictest -t <number of cores> -q -e 10000000 -i <500-12000> -d 150 -l 2000
> >
> > The number of loops ("-l 2000") should be much larger to create useful
> > results.  I don't have a specific number that is large enough, I just
> > know from experience that 2000 is way too small.  For example, running
> > cyclictest several times with the same values on my laptop gives values
> > that are not consistent:
> 
> The Avg figures look almost stable IMO. Are you speaking about the Max
> value for the inconsistency ?

The values on my laptop for "-l 2000" are not stable.

If I collapse all of the threads in each of the following tests to a
single value I get the following table.  Note that each thread completes
a different number of cycles, so I calculate the average as:

  total count = T0_count + T1_count + T2_count + T3_count

  avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count

  min is the smallest min for any of the threads

  max is the largest max for any of the threads

            total
test   T    count  min     avg   max
---- --- -------- ---- ------- -----
   1   4     5886    2    76.0  1017
   2   4     5881    2    71.5   810
   3   4     5885    2    74.2  1143
   4   4     5884    2    68.9  1279

test 1 average is 10% larger than test 4.

test 4 maximum is 50% larger than test2.

But all of this is just a minor detail of how to run cyclictest.  The more
important question is whether to use cyclictest results as a valid workload
or metric, so for the moment I won't comment further on the cyclictest
parameters you used to collect the example data you provided.


> 
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9703) P: 0 I:500 C:   2000 Min:      2 Act:   90 Avg:   77 Max:     243
> >    T: 1 ( 9704) P: 0 I:650 C:   1557 Min:      2 Act:   58 Avg:   68 Max:     226
> >    T: 2 ( 9705) P: 0 I:800 C:   1264 Min:      2 Act:   54 Avg:   81 Max:    1017
> >    T: 3 ( 9706) P: 0 I:950 C:   1065 Min:      2 Act:   11 Avg:   80 Max:     260
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9709) P: 0 I:500 C:   2000 Min:      2 Act:   45 Avg:   74 Max:     390
> >    T: 1 ( 9710) P: 0 I:650 C:   1554 Min:      2 Act:   82 Avg:   61 Max:     810
> >    T: 2 ( 9711) P: 0 I:800 C:   1263 Min:      2 Act:   83 Avg:   74 Max:     287
> >    T: 3 ( 9712) P: 0 I:950 C:   1064 Min:      2 Act:  103 Avg:   79 Max:     551
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9716) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   72 Max:     252
> >    T: 1 ( 9717) P: 0 I:650 C:   1556 Min:      2 Act:  115 Avg:   77 Max:     354
> >    T: 2 ( 9718) P: 0 I:800 C:   1264 Min:      2 Act:   59 Avg:   78 Max:    1143
> >    T: 3 ( 9719) P: 0 I:950 C:   1065 Min:      2 Act:  104 Avg:   70 Max:     238
> >
> >    $ sudo ./cyclictest -t -q -e 10000000 -i 500 -d 150 -l 2000
> >    # /dev/cpu_dma_latency set to 10000000us
> >    T: 0 ( 9722) P: 0 I:500 C:   2000 Min:      2 Act:   82 Avg:   68 Max:     213
> >    T: 1 ( 9723) P: 0 I:650 C:   1555 Min:      2 Act:   65 Avg:   65 Max:    1279
> >    T: 2 ( 9724) P: 0 I:800 C:   1264 Min:      2 Act:   91 Avg:   69 Max:     244
> >    T: 3 ( 9725) P: 0 I:950 C:   1065 Min:      2 Act:   58 Avg:   76 Max:     242
> >
> >
> >> >
> >> > cyclictest could be a good starting point but we need to improve it to
> >> > allow threads of different loads, possibly starting multiple processes
> >> > (can be done with a script), randomly varying load threads. These
> >> > parameters should be loaded from a file so that we can have multiple
> >> > configurations (per SoC and per use-case). But the big risk is that we
> >> > try to optimise the scheduler for something which is not realistic.
> >>
> >> The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a "real" use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system.
> >>
> >> Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results.
> >>
> >> I agree that we have to test "real" use cases but it doesn't prevent from testing the limit of a characteristic on a system
> >>
> >> >
> >> >
> >> > We are working on describing some basic scenarios (plain English for
> >> > now) and one of them could be video playing with threads for audio and
> >> > video decoding with random change in the workload.
> >> >
> >> > So I think the first step should be a set of tools/scripts to analyse
> >> > the scheduler behaviour, both in terms of latency and power, and these
> >> > can use perf sched. We can then run some real life scenarios (e.g.
> >> > Android video playback) and build a benchmark that matches such
> >> > behaviour as close as possible. We can probably use (or improve) perf
> >> > sched replay to also simulate such workload (we may need additional
> >> > features like thread dependencies).
> >> >
> >> >> The figures below give the average wakeup latency and power
> >> >> consumption for default scheduler behavior, packing tasks at cluster
> >> >> level and packing tasks at core level. We can see both wakeup latency
> >> >> and power consumption variation. The detailed result is not a simple
> >> >> single value which makes comparison not so easy but the average of all
> >> >> measurements should give us a usable “score”.
> >> >
> >> > How did you assess the power/energy?
> >>
> >> I have use the embedded joule meter of the tc2.
> >>
> >> >
> >> > Thanks.
> >> >
> >> > --
> >> > Catalin
> >>
> >>             |  Default average results                  |  Cluster Packing average results          |  Core Packing average results
> >>             |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
> >>             |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
> >>             |      879                794890    2364175 |      416                879688      12750 |      189                897452      30052
> >>
> >>  Cyclictest |  Default                                  |  Packing at Cluster level                 |  Packing at Core level
> >>    Interval |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy |  Latency     stddev  A7 energy A15 energy
> >>        (us) |     (us)                   (J)        (J) |     (us)                   (J)        (J) |     (us)                   (J)        (J)
> >>         500         24          1    1147477    2479576         21          1    1136768      11693         22          1    1126062      30138
> >>         700         22          1    1136084    3058419         21          0    1125280      11761         21          1    1109950      23503
> >
> > < snip >
> >

Thanks for clarifying how the data was calculated (below).  Again, I don't think
this level of detail is the most important issue at this point, but I'm going
to comment on it while it is still fresh in my mind.

> > Some questions about what these metrics are:
> >
> > The cyclictest data is reported per thread.  How did you combine the per thread data
> > to get a single latency and stddev value?
> >
> > Is "Latency" the average latency?
> 
> Yes. I have described below the procedure i have followed to get my results:
> 
> I run the same test (same parameters) several times ( i have tried
> between 5 and 10 runs and the results were similar).
> For each run, i compute the average of per thread average figure and i
> compute the stddev between per thread results.

So the test run stddev is the standard deviation of the values for average
latency of the 8 (???) cyclictest threads in a test run?

If so, I don't think that the calculated stddev has much actual meaning for
comparing the algorithms (I do find it useful to get a loose sense of how
consistent multiple test runs with the same parameters).

> The results that i sent is an average of all runs with the same parameters.

Then the stddev in the table is the average of the stddev in several test runs?

The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100%
of the average latency.  That is rather large.

> 
> >
> > stddev is not reported by cyclictest.  How did you create this value?  Did you
> > use the "-v" cyclictest option to report detailed data, then calculate stddev from
> > the detailed data?
> 
> No i haven't used the -v because it generates too much spurious wake
> up that makes the results irrelevant

Yes, I agree about not using -v.  It was just a wild guess on my part since
I did not know how stddev was calculated.  And I was incorrectly guessing
that stdev was describing the frequency distribution of the latencies
from a single test run.

As a general comment on cyclictest, I don't find average latency
(in isolation) sufficient to compare different runs of cyclictest.
And stddev of the frequency distribution of the latencies (which
can be calculated from the -h data, with fairly low cyclictest
overhead) is usually interesting but should be viewed with a healthy
skepticism since that frequency distribution is often not a normal
distribution.  In addition to average latency, I normally look at
maximum latency and the frequency distribution of latence (in table
or graph form).

(One side effect of specifying -h is that the -d option is then
ignored.)

Thanks,

-Frank--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/