[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <000e01d568b5$87de9be0$979bd3a0$@net>
Date: Wed, 11 Sep 2019 08:28:07 -0700
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Giovanni Gherdovich'" <ggherdovich@...e.cz>
Cc: <x86@...nel.org>, <linux-pm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <mgorman@...hsingularity.net>,
<matt@...eblueprint.co.uk>, <viresh.kumar@...aro.org>,
<juri.lelli@...hat.com>, <pjt@...gle.com>,
<vincent.guittot@...aro.org>, <qperret@...rret.net>,
<dietmar.eggemann@....com>, <srinivas.pandruvada@...ux.intel.com>,
<tglx@...utronix.de>, <mingo@...hat.com>, <peterz@...radead.org>,
<bp@...e.de>, <lenb@...nel.org>, <rjw@...ysocki.net>,
"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [PATCH 1/2] x86,sched: Add support for frequency invariance
Hi Giovanni,
Thank you for the great detail and test results you provided.
On 2019.09.08.07:42 Giovanni Gherdovich wrote:
... [snip]...
> The test we call "gitsource" (running the git unit test suite, a long-running
> single-threaded shell script) appears rather spectacular in this table (gains
> of 30-50% depending on the machine). It is to be noted, however, that
> gitsource has no adjustable parameters (such as the number of jobs in
> kernbench, which we average over in order to get a single-number summary
> score) and is exactly the kind of low-parallelism workload that benefits the
> most from this patch. When looking at the detailed tables of kernbench or
> tbench4, at low process or client counts one can see similar numbers.
I think the "gitsource" test, is the one I learned about here two years
ago, [1]. It is an extremely good (best I know of) example of single
threaded, high PID consumption (about 400 / second average, my computer
[3]), performance issues on a multi CPU computer. I.E., this:
Dountil the list of tasks is finished:
Start the next task in the list of stuff to do.
Wait for it to finish
Enduntil
The problem with the test is its run to run variability, which was from
all the disk I/O, as far as I could determine. At the time,
I studied this to death [2], and made a more repeatable test, without
any disk I/O.
While the challenges with this work flow have tended to be focused
on the CPU frequency scaling driver, I have always considered
the root issue here to be a scheduling issue. Excerpt from my notes
[2]:
> The issue is that performance is much much better if the system is
> forced to use only 1 CPU rather than relying on the defaults where
> the CPU scheduler decides what to do.
> The scheduler seems to not realize that the current CPU has just
> become free, and assigns the new task to a new CPU. Thus the load
> on any one CPU is so low that it doesn't ramp up the CPU frequency.
> It would be better if somehow the scheduler knew that the current
> active CPU was now able to take on the new task, overall resulting
> on one fully loaded CPU at the highest CPU frequency.
I do not know if such is practical, and I didn't re-visit the issue.
Anyway these are my results:
Kernel: 5.3-rc8 and + these patches
Processor: i7-2600K
This is important, at least for the performance governor numbers:
cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526
35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores
For reference against which all other results are compared
is the forced CPU affinity test run. i.e.:
taskset -c 3 test_script.
Mode Governor degradation Power Bzy_MHz
Reference perf 1 CPU 1.00 reference 3798
- performance 1.2 6% worse 3618
passive ondemand 2.3
active powersave 2.6
passive schedutil 2.7 1600
passive schedutil-4C 1.68 2515
Where degradation ratio is the time to execute / the reference time for
the same conditions. The test runs over a wide range of processes per
second, and the worst ratio has been selected for the above table.
I have yet to write up this experiment, but the graphs that will
eventually be used are at [4] and [5] (same data presented two
different ways).
The energy for the performance cases is worth more detail, as it
is being wasted with CPUs waking up and going to sleep, and can be
observed in the IRQ column of turbostat output:
$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60
Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt
12.52 3798 81407 49 22.17 0.12 <<< Forced to CPU 3 only
12.52 3798 81139 51 22.18 0.12
12.52 3798 81036 51 22.20 0.12
11.43 3704 267644 48 21.16 0.12 <<< Change over
12.56 3618 490994 48 23.43 0.12 <<< Let the scheduler decide
12.56 3620 491336 47 23.50 0.12
12.56 3619 491607 47 23.50 0.12
12.56 3619 491512 48 23.52 0.12
12.56 3619 490806 47 23.51 0.12
12.56 3618 491356 49 23.48 0.12
12.56 3618 491035 48 23.51 0.12
12.56 3618 491121 48 23.46 0.12
Note also the busy megahertz column, where other active cores
(constantly waking and sleeping as we rotate through which
CPUs are used) are limiting the highest frequency.
... Doug
[1] https://marc.info/?l=linux-kernel&m=149181369622980&w=2
[2] http://www.smythies.com/~doug/linux/single-threaded/index.html
[3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png
[4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png
[5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png
Powered by blists - more mailing lists