linux-kernel - Re: [RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4168371.zz0pVZtGOY@aspire.rjw.lan>
Date:   Sun, 04 Nov 2018 11:06:06 +0100
From:   "Rafael J. Wysocki" <rjw@...ysocki.net>
To:     Giovanni Gherdovich <ggherdovich@...e.cz>
Cc:     Linux PM <linux-pm@...r.kernel.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Mel Gorman <mgorman@...e.de>,
        Doug Smythies <dsmythies@...us.net>,
        Daniel Lezcano <daniel.lezcano@...aro.org>
Subject: Re: [RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

On Wednesday, October 31, 2018 7:36:21 PM CET Giovanni Gherdovich wrote:
> On Fri, 2018-10-26 at 11:12 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>

[cut]

> 
> Hello Rafael,

Hi Giovanni,

First off, many thanks for doing this work, it is very very much appreciated!

> your new governor has a neutral impact on performance, as you expected. This is
> a positive result, since the purpose of "teo" is to give improved
> predictions on idle times without regressing on the performance side.

Right.

> There are swings here and there but nothing looks extremely bad. v2 is largely
> equivalent to v1 in my tests, except for sockperf and netperf on the
> Haswell machine (v2 slightly worse) and tbench on the Skylake machine
> (again v2 slightly worse).

Thanks for the data.

I have some ideas on what may be the difference between the v1 and the v2 on
these machines, more about that below.

> I've tested your patches applying them on v4.18 (plus the backport
> necessary for v2 as Doug helpfully noted), just because it was the latest
> release when I started preparing this.
> 
> I've tested it on three machines, with different generations of Intel CPUs:
> 
> * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
> * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here onwards)
> * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here onwards)
> 
> 
> BENCHMARKS WITH NEUTRAL RESULTS
> ===============================
> 
> These are the workloads where no noticeable difference is measured (on both
> v1 and v2, all machines), together with the corresponding MMTests[1]
> configuration file name:
> 
> * pgbench read-only on xfs, pgbench read/write on xfs
> 	* global-dhp__db-pgbench-timed-ro-small-xfs
> 	* global-dhp__db-pgbench-timed-rw-small-xfs
> * siege
> 	* global-dhp__http-siege
> * hackbench, pipetest
> 	* global-dhp__scheduler-unbound
> * Linux kernel compilation
> 	* global-dhp__workload_kerndevel-xfs
> * NASA Parallel Benchmarks, C-Class (linear algebra; run both with OpenMP
>   and OpenMPI, over xfs)
> 	* global-dhp__nas-c-class-mpi-full-xfs
> 	* global-dhp__nas-c-class-omp-full
> * FIO (Flexible IO) in several configurations
> 	* global-dhp__io-fio-randread-async-randwrite-xfs
> 	* global-dhp__io-fio-randread-async-seqwrite-xfs
> 	* global-dhp__io-fio-seqread-doublemem-32k-4t-xfs
> 	* global-dhp__io-fio-seqread-doublemem-4k-4t-xfs
> * netperf on loopback over TCP
> 	* global-dhp__network-netperf-unbound

The above is great to know.

> BENCHMARKS WITH NON-NEUTRAL RESULTS: OVERVIEW
> =============================================
> 
> These are benchmarks which exhibit a variation in their performance;
> you'll see the magnitude of the changes is moderate and it's highly variable
> from machine to machine. All percentages refer to the v4.18 baseline. In
> more than one case the Haswell machine seems to prefer v1 to v2.
> 
> * xfsrepair
> 	* global-dhp__io-xfsrepair-xfs
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		2% worse	2% worse
> 		80x-BROADWELL-NUMA	1% worse	1% worse
> 		48x-HASWELL-NUMA	1% worse	1% worse
> 
> * sqlite (insert operations on xfs)
> 	* global-dhp__db-sqlite-insert-medium-xfs
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		no change	no change
> 		80x-BROADWELL-NUMA	2% worse	3% worse
> 		48x-HASWELL-NUMA	no change	no change
> 
> * netperf on loopback over UDP
> 	* global-dhp__network-netperf-unbound
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		no change	6% worse
> 		80x-BROADWELL-NUMA	1% worse	4% worse
> 		48x-HASWELL-NUMA	3% better	5% worse
> 
> * sockperf on loopback over TCP, mode "under load"
> 	* global-dhp__network-sockperf-unbound
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		6% worse	no change
> 		80x-BROADWELL-NUMA	7% better	no change
> 		48x-HASWELL-NUMA	3% better	2% worse
> 
> * sockperf on loopback over UDP, mode "throughput"
> 	* global-dhp__network-sockperf-unbound

Generally speaking, I'm not worried about single-digit percent differences,
because overall they tend to fall into the noise range in the grand picture.

> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		1% worse	1% worse
> 		80x-BROADWELL-NUMA	3% better	2% better
> 		48x-HASWELL-NUMA	4% better	12% worse

But the 12% difference here is slightly worrisome.

> * sockperf on loopback over UDP, mode "under load"
> 	* global-dhp__network-sockperf-unbound
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		3% worse	1% worse
> 		80x-BROADWELL-NUMA	10% better	8% better
> 		48x-HASWELL-NUMA	1% better	no change
> 
> * dbench on xfs
>         * global-dhp__io-dbench4-async-xfs
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		3% better	4% better
> 		80x-BROADWELL-NUMA	no change	no change
> 		48x-HASWELL-NUMA	6% worse	16% worse

And same here.

> * tbench on loopback
> 	* global-dhp__network-tbench
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		1% worse	10% worse
> 		80x-BROADWELL-NUMA	1% worse	1% worse
> 		48x-HASWELL-NUMA	1% worse	2% worse
> 
> * schbench
> 	* global-dhp__workload_schbench
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		1% better	no change
> 		80x-BROADWELL-NUMA	2% worse	1% worse
> 		48x-HASWELL-NUMA	2% worse	3% worse
> 
> * gitsource on xfs (git unit tests, shell intensive)
> 	* global-dhp__workload_shellscripts-xfs
> 
> 					teo-v1		teo-v2
> 		-------------------------------------------------
> 		8x-SKYLAKE-UMA		no change	no change
> 		80x-BROADWELL-NUMA	no change	1% better
> 		48x-HASWELL-NUMA	no change	1% better
> 
> 
> BENCHMARKS WITH NON-NEUTRAL RESULTS: DETAIL
> ===========================================
> 
> Now some more detail. Each benchmark is run in a variety of configurations
> (eg. number of threads, number of concurrent connections and so forth) each
> of them giving a result. What you see above is the geometric mean of
> "sub-results"; below is the detailed view where there was a regression
> larger than 5% (either in v1 or v2, on any of the machines). That means
> I'll exclude xfsrepar, sqlite, schbench and the git unit tests "gitsource"
> that have negligible swings from the baseline.
> 
> In all tables asterisks indicate a statement about statistical
> significance: the difference with baseline has a p-value smaller than 0.1
> (small p-values indicate that the difference is real and not just random
> noise).
> 
> NETPERF-UDP
> ===========
> NOTES: Test run in mode "stream" over UDP. The varying parameter is the
>     message size in bytes. Each measurement is taken 5 times and the
>     harmonic mean is reported.
> MEASURES: Throughput in MBits/second, both on the sender and on the receiver end.
> HIGHER is better
> 
> machine: 8x-SKYLAKE-UMA
>                                      4.18.0                 4.18.0                 4.18.0
>                                     vanilla                 teo-v1        teo-v2+backport
> -----------------------------------------------------------------------------------------
> Hmean     send-64         362.27 (   0.00%)      362.87 (   0.16%)      318.85 * -11.99%*
> Hmean     send-128        723.17 (   0.00%)      723.66 (   0.07%)      660.96 *  -8.60%*
> Hmean     send-256       1435.24 (   0.00%)     1427.08 (  -0.57%)     1346.22 *  -6.20%*
> Hmean     send-1024      5563.78 (   0.00%)     5529.90 *  -0.61%*     5228.28 *  -6.03%*
> Hmean     send-2048     10935.42 (   0.00%)    10809.66 *  -1.15%*    10521.14 *  -3.79%*
> Hmean     send-3312     16898.66 (   0.00%)    16539.89 *  -2.12%*    16240.87 *  -3.89%*
> Hmean     send-4096     19354.33 (   0.00%)    19185.43 (  -0.87%)    18600.52 *  -3.89%*
> Hmean     send-8192     32238.80 (   0.00%)    32275.57 (   0.11%)    29850.62 *  -7.41%*
> Hmean     send-16384    48146.75 (   0.00%)    49297.23 *   2.39%*    48295.51 (   0.31%)
> Hmean     recv-64         362.16 (   0.00%)      362.87 (   0.19%)      318.82 * -11.97%*
> Hmean     recv-128        723.01 (   0.00%)      723.66 (   0.09%)      660.89 *  -8.59%*
> Hmean     recv-256       1435.06 (   0.00%)     1426.94 (  -0.57%)     1346.07 *  -6.20%*
> Hmean     recv-1024      5562.68 (   0.00%)     5529.90 *  -0.59%*     5228.28 *  -6.01%*
> Hmean     recv-2048     10934.36 (   0.00%)    10809.66 *  -1.14%*    10519.89 *  -3.79%*
> Hmean     recv-3312     16898.65 (   0.00%)    16538.21 *  -2.13%*    16240.86 *  -3.89%*
> Hmean     recv-4096     19351.99 (   0.00%)    19183.17 (  -0.87%)    18598.33 *  -3.89%*
> Hmean     recv-8192     32238.74 (   0.00%)    32275.13 (   0.11%)    29850.39 *  -7.41%*
> Hmean     recv-16384    48146.59 (   0.00%)    49296.23 *   2.39%*    48295.03 (   0.31%)

That is a bit worse than I would like it to be TBH.

> SOCKPERF-TCP-UNDER-LOAD
> =======================
> NOTES: Test run in mode "under load" over TCP. Parameters are message size
>     and transmission rate.
> MEASURES: Round-trip time in microseconds
> LOWER is better
> 
> machine: 8x-SKYLAKE-UMA
>                                                  4.18.0                 4.18.0                 4.18.0
>                                                 vanilla                 teo-v1        teo-v2+backport
> -----------------------------------------------------------------------------------------------------
> Amean        size-14-rate-10000        36.43 (   0.00%)       36.86 (  -1.17%)       20.24 (  44.44%)
> Amean        size-14-rate-24000        17.78 (   0.00%)       17.71 (   0.36%)       18.54 (  -4.29%)
> Amean        size-14-rate-50000        20.53 (   0.00%)       22.29 (  -8.58%)       16.16 (  21.30%)
> Amean        size-100-rate-10000       21.22 (   0.00%)       23.41 ( -10.35%)       33.04 ( -55.73%)
> Amean        size-100-rate-24000       17.81 (   0.00%)       21.09 ( -18.40%)       14.39 (  19.18%)
> Amean        size-100-rate-50000       12.31 (   0.00%)       19.65 ( -59.64%)       15.11 ( -22.77%)
> Amean        size-300-rate-10000       34.21 (   0.00%)       35.30 (  -3.19%)       34.20 (   0.05%)
> Amean        size-300-rate-24000       24.52 (   0.00%)       26.00 (  -6.04%)       27.42 ( -11.81%)
> Amean        size-300-rate-50000       20.20 (   0.00%)       20.39 (  -0.95%)       17.83 (  11.73%)
> Amean        size-500-rate-10000       21.56 (   0.00%)       21.31 (   1.15%)       29.32 ( -35.98%)
> Amean        size-500-rate-24000       30.58 (   0.00%)       27.41 (  10.38%)       27.21 (  11.03%)
> Amean        size-500-rate-50000       19.46 (   0.00%)       22.48 ( -15.55%)       16.29 (  16.30%)
> Amean        size-850-rate-10000       35.89 (   0.00%)       35.56 (   0.91%)       23.84 (  33.57%)
> Amean        size-850-rate-24000       29.11 (   0.00%)       28.18 (   3.20%)       17.44 (  40.08%)
> Amean        size-850-rate-50000       13.55 (   0.00%)       18.05 ( -33.26%)       21.30 ( -57.20%)

IMO there is too much variation here to draw any meaningful conclusions from it.

> SOCKPERF-UDP-THROUGHPUT
> =======================
> NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
>     message size.
> MEASURES: Throughput, in MBits/second
> HIGHER is better
> 
> machine: 48x-HASWELL-NUMA
>                               4.18.0                 4.18.0                 4.18.0
>                              vanilla                 teo-v1        teo-v2+backport
> ----------------------------------------------------------------------------------
> Hmean     14        48.16 (   0.00%)       50.94 *   5.77%*       42.50 * -11.77%*
> Hmean     100      346.77 (   0.00%)      358.74 *   3.45%*      303.31 * -12.53%*
> Hmean     300     1018.06 (   0.00%)     1053.75 *   3.51%*      895.55 * -12.03%*
> Hmean     500     1693.07 (   0.00%)     1754.62 *   3.64%*     1489.61 * -12.02%*
> Hmean     850     2853.04 (   0.00%)     2948.73 *   3.35%*     2473.50 * -13.30%*

Well, in this case the consistent improvement in v1 turned into a consistent decline
in the v2, and over 10% for that matter.  Needs improvement IMO.

> DBENCH4
> =======
> NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
> MEASURES: latency (millisecs)
> LOWER is better
> 
> machine: 48x-HASWELL-NUMA
>                               4.18.0                 4.18.0                 4.18.0
>                              vanilla                 teo-v1        teo-v2+backport
> ----------------------------------------------------------------------------------
> Amean      1        37.15 (   0.00%)       50.10 ( -34.86%)       39.02 (  -5.03%)
> Amean      2        43.75 (   0.00%)       45.50 (  -4.01%)       44.36 (  -1.39%)
> Amean      4        54.42 (   0.00%)       58.85 (  -8.15%)       58.17 (  -6.89%)
> Amean      8        75.72 (   0.00%)       74.25 (   1.94%)       82.76 (  -9.30%)
> Amean      16      116.56 (   0.00%)      119.88 (  -2.85%)      164.14 ( -40.82%)
> Amean      32      570.02 (   0.00%)      561.92 (   1.42%)      681.94 ( -19.63%)
> Amean      64     3185.20 (   0.00%)     3291.80 (  -3.35%)     4337.43 ( -36.17%)

This one too.

> TBENCH4
> =======
> NOTES: networking counterpart of dbench. Varies the number of clients up to NUMCPUS*4
> MEASURES: Throughput, MB/sec
> HIGHER is better
> 
> machine: 8x-SKYLAKE-UMA
>                                     4.18.0                 4.18.0                 4.18.0
>                                    vanilla                    teo        teo-v2+backport
> ----------------------------------------------------------------------------------------
> Hmean     mb/sec-1       620.52 (   0.00%)      613.98 *  -1.05%*      502.47 * -19.03%*
> Hmean     mb/sec-2      1179.05 (   0.00%)     1112.84 *  -5.62%*      820.57 * -30.40%*
> Hmean     mb/sec-4      2072.29 (   0.00%)     2040.55 *  -1.53%*     2036.11 *  -1.75%*
> Hmean     mb/sec-8      4238.96 (   0.00%)     4205.01 *  -0.80%*     4124.59 *  -2.70%*
> Hmean     mb/sec-16     3515.96 (   0.00%)     3536.23 *   0.58%*     3500.02 *  -0.45%*
> Hmean     mb/sec-32     3452.92 (   0.00%)     3448.94 *  -0.12%*     3428.08 *  -0.72%*
> 

And same here.

> [1] https://github.com/gormanm/mmtests
> 
> 
> Happy to answer any questions on the benchmarks or the methods used to
> collect/report data.
> 
> Something I'd like to do now is verify that "teo"'s predictions are better
> than "menu"'s; I'll probably use systemtap to make some histograms of idle
> times versus what idle state was chosen -- that'd be enough to compare the
> two.

You can use the cpu_idle trace point to correlate the selected state index
with the observed idle duration (that's what Doug did IIUC).

Then, if the observed idle duration is between the target residency of the
selected state and the target residency of the next one, the selected state
is adequate and that's what we care about really.

If the observed idle duration is below the target residency of the selected
state, the selected state is too deep and it if is above (or equal to) the
target residency of the next state, it is too shallow.

> After that it would be nice to somehow know where timers came from; i.e. if
> I see that residences in a given state are consistently shorter than
> they're supposed to be, it would be interesting to see who set the timer
> that causes the wakeup. But... I'm not sure to know how to do that :) Do
> you have a strategy to track down the origin of timers/interrupts? Is there
> any script you're using to evaluate teo that you can share?

I need to think about that TBH.

The information that we can get readily should give use quite a good idea of
what happens on average, though, so let's first do that and then try to dig
deeper if need be.

I think that the difference between the v1 and v2 of the TEO governor comes
mostly from the way in which they handle patterns of "early" wakeups.  The
method used in v1 is very crude (and arguably invalid in general) and it
will cause shallow states to be selected more often, while the v2 tries to
be more "intelligent", but it may be overly conservative with that.

I'm working on a v3 that will try to address the above ATM, but I'd like to run
it on my systems first (I'm going back home from a conference right now).

Cheers,
Rafael