linux-kernel - RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <001a01d56ef8$7abb07c0$70311740$@net>
Date:   Thu, 19 Sep 2019 07:42:29 -0700
From:   "Doug Smythies" <dsmythies@...us.net>
To:     "'Giovanni Gherdovich'" <ggherdovich@...e.cz>
Cc:     <x86@...nel.org>, <linux-pm@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <mgorman@...hsingularity.net>,
        <matt@...eblueprint.co.uk>, <viresh.kumar@...aro.org>,
        <juri.lelli@...hat.com>, <pjt@...gle.com>,
        <vincent.guittot@...aro.org>, <qperret@...rret.net>,
        <dietmar.eggemann@....com>, <srinivas.pandruvada@...ux.intel.com>,
        <tglx@...utronix.de>, <mingo@...hat.com>, <peterz@...radead.org>,
        <bp@...e.de>, <lenb@...nel.org>, <rjw@...ysocki.net>
Subject: RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

Hi Giovanni,

Thank you for your detailed reply.

On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> [...]

>> The problem with the test is its run to run variability, which was from
>> all the disk I/O, as far as I could determine. At the time,
>> I studied this to death [2], and made a more repeatable test, without
>> any disk I/O.
>> 
>> While the challenges with this work flow have tended to be focused
>> on the CPU frequency scaling driver, I have always considered
>> the root issue here to be a scheduling issue. Excerpt from my notes
>> [2]:
>> 
>>> The issue is that performance is much much better if the system is
>>> forced to use only 1 CPU rather than relying on the defaults where
>>> the CPU scheduler decides what to do.
>>> The scheduler seems to not realize that the current CPU has just
>>> become free, and assigns the new task to a new CPU. Thus the load
>>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>>> It would be better if somehow the scheduler knew that the current
>>> active CPU was now able to take on the new task, overall resulting
>>> on one fully loaded CPU at the highest CPU frequency.
>> 
>> I do not know if such is practical, and I didn't re-visit the issue.
>>
>
> You're absolutely right, pinning a serialized, fork-intensive workload such as
> gitsource gives you as good of a performance as you can get, because it removes
> the scheduler out of the picture.
>
> So one might be tempted to flag this test as non-representative of a
> real-world scenario;

Disagree. I consider this test to be very representative of real-world
scenarios. However, and I do not know for certain, the relatively high
average fork rate of the gitsource "make test" is less common.

> the reasons we keep looking at it are:
> 1. pinning may not always practical, as you mention
> 2. it's an adversary, worst-case sort of test for some scheduler code paths

Agree.

>> For reference against which all other results are compared
>> is the forced CPU affinity test run. i.e.:
>> 
>> taskset -c 3 test_script.
>> 
>> Mode          Governor                degradation     Power           Bzy_MHz
>> Reference     perf 1 CPU              1.00            reference       3798
>> -             performance             1.2             6% worse        3618
>> passive       ondemand                2.3
>> active        powersave               2.6
>> passive       schedutil               2.7                             1600
>> passive       schedutil-4C            1.68                            2515
>> 
>> Where degradation ratio is the time to execute / the reference time for
>> the same conditions. The test runs over a wide range of processes per
>> second, and the worst ratio has been selected for the above table.
>> I have yet to write up this experiment, but the graphs that will
>> eventually be used are at [4] and [5] (same data presented two
>> different ways).
>
> Your table is interesting; I'd say that the one to beat there (from the
> schedutil point of view) is intel_pstate(active)/performance. I'm slightly
> surprised that intel_pstate(passive)/ondemand is worse than
> intel_pstate(active)/powersave, I'd have guessed the other way around but it's
> also true that the latter lost some grip on iowait_boost in of the recent
> dev cycles.

??
intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave,
not worse, over the entire range of PIDs (forks) per second and by quite a lot.

>> I did the "make test" method and, presenting the numbers your way,
>> got that 4C took 0.69 times as long as the unpatched schedutil.
>> Your numbers were same or better (copied below, lower is better):
>> 80x-BROADWELL-NUMA:   0.49
>> 8x-SKYLAKE-UMA:               0.55
>> 48x-HASWELL-NUMA:             0.69

> I think your 0.69 and my three values tell the same story: schedutil really
> needs to use the frequency invariant formula otherwise it's out of the
> race. Enabling scale-invariance gives multple tens of percent point in
> advantage.

Agreed. This frequency invariant addition is great. However, if
schedutil is "out of the race" without it, as you say, then isn't
intel_pstate(passive)/ondemand out of the race also? It performs
just as poorly for this test, until very low PIDs per second.

> Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I
> can test on multiple machines and get a somehow more varied picture.
>
> Also, didn't you mention you made several runs and selected the worst one for
> the final score? I was less adventurous and took the average of 5 runs for my
> gitsource executions :) that might contribute to a slightly higher final mark.

No, I did the exact same as you for the gitsource "make test" method, except
that I do 6 runs and throw out the first one and average the next 5.

Yes, I said I picked the worse ratio, but that was for my version of this test,
with the disk I/O and its related non-repeatability eliminated, only to provide
something for readers that did not want to go to my web site to look at the
related graph [1]. I'll send you the graph in a separate e-mail, in case you didn't
go to the web site.

>>>> 
>>>> Compare it to the update formula of intel_pstate/powersave:
>>> 
>>>    freq_next = 1.25 * freq_max * Busy%
>>> 
>>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent
>>> idling (calculated with delta_MPERF / delta_TSC);
>> 
>> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
>> method of utilization does not (at least not last time I investigated, which was
>> awhile ago (and I can not find my notes)).
>
> I think that depends on whether or not TSC stops at idle. As understand from
> the Intel Software Developer manual (SDM) a TSC that stops at idle is called
> "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the
> two counters behaves exactly the same and the ratio is always 1, modulo the
> delays in actually reading the two values. But all I know comes from
> turbostat's man page and the SDM, so don't quote me on that :)

I was only talking about idle state 0 (polling), where TSC does not stop.

By the way, I have now done some tests with this patch set and multi-threaded
stuff. Nothing to report, it all looks great.

[1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png

... Doug