[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87lf0grx38.fsf@riseup.net>
Date: Sun, 19 Dec 2021 15:31:07 -0800
From: Francisco Jerez <currojerez@...eup.net>
To: Julia Lawall <julia.lawall@...ia.fr>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
Len Brown <lenb@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Linux PM <linux-pm@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range
Julia Lawall <julia.lawall@...ia.fr> writes:
> On Sun, 19 Dec 2021, Francisco Jerez wrote:
>
>> Julia Lawall <julia.lawall@...ia.fr> writes:
>>
>> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
>> >
>> >> Julia Lawall <julia.lawall@...ia.fr> writes:
>> >>
>> >> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
>> >> >
>> >> >> Julia Lawall <julia.lawall@...ia.fr> writes:
>> >> >>
>> >> >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
>> >> >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
>> >> >> >> Ratio (R/O)". However that seems to deviate massively from the most
>> >> >> >> efficient ratio on your system, which may indicate a firmware bug, some
>> >> >> >> sort of clock gating problem, or an issue with the way that
>> >> >> >> intel_pstate.c processes this information.
>> >> >> >
>> >> >> > I'm not sure to understand the bug part. min_pstate gives the frequency
>> >> >> > that I find as the minimum frequency when I look for the specifications of
>> >> >> > the CPU. Should one expect that it should be something different?
>> >> >> >
>> >> >>
>> >> >> I'd expect the minimum frequency on your processor specification to
>> >> >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
>> >> >> since there's little reason to claim your processor can be clocked down
>> >> >> to a frequency which is inherently inefficient /and/ slower than the
>> >> >> maximum efficiency ratio -- In fact they both seem to match in your
>> >> >> system, they're just nowhere close to the frequency which is actually
>> >> >> most efficient, which smells like a bug, like your processor
>> >> >> misreporting what the most efficient frequency is, or it deviating from
>> >> >> the expected one due to your CPU static power consumption being greater
>> >> >> than it would be expected to be under ideal conditions -- E.g. due to
>> >> >> some sort of clock gating issue, possibly due to a software bug, or due
>> >> >> to our scheduling of such workloads with a large amount of lightly
>> >> >> loaded threads being unnecessarily inefficient which could also be
>> >> >> preventing most of your CPU cores from ever being clock-gated even
>> >> >> though your processor may be sitting idle for a large fraction of their
>> >> >> runtime.
>> >> >
>> >> > The original mail has results from two different machines: Intel 6130
>> >> > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
>> >> > of 6130s and 5218s. I can try them.
>> >> >
>> >> > I tried 5.9 in which I just commented out the schedutil code to make
>> >> > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
>> >> > pauses) and in both case the execution is almost entirely in the turbo
>> >> > frequencies.
>> >> >
>> >> > I'm not sure to understand the term "clock-gated". What C state does that
>> >> > correspond to? The turbostat output for one run of avrora is below.
>> >> >
>> >>
>> >> I didn't have any specific C1+ state in mind, most of the deeper ones
>> >> implement some sort of clock gating among other optimizations, I was
>> >> just wondering whether some sort of software bug and/or the highly
>> >> intermittent CPU utilization pattern of these workloads are preventing
>> >> most of your CPU cores from entering deep sleep states. See below.
>> >>
>> >> > julia
>> >> >
>> >> > 78.062895 sec
>> >> > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
>> >> > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
>> >> > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
>> >>
>> >> This seems suspicious: ^^^^ ^^^^^^^
>> >>
>> >> I hadn't understood that you're running this on a dual-socket system
>> >> until I looked at these results.
>> >
>> > Sorry not to have mentioned that.
>> >
>> >> It seems like package #0 is doing
>> >> pretty much nothing according to the stats below, but it's still
>> >> consuming nearly half of your energy, apparently because the idle
>> >> package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
>> >> 0%). That could explain your unexpectedly high static power consumption
>> >> and the deviation of the real maximum efficiency frequency from the one
>> >> reported by your processor, since the reported maximum efficiency ratio
>> >> cannot possibly take into account the existence of a second CPU package
>> >> with dysfunctional idle management.
>> >
>> > Our assumption was that if anything happens on any core, all of the
>> > packages remain in a state that allows them to react in a reasonable
>> > amount of time ot any memory request.
>>
>> I can see how that might be helpful for workloads that need to be able
>> to unleash the whole processing power of your multi-socket system with
>> minimal latency, but the majority of multi-socket systems out there with
>> completely idle CPU packages are unlikely to notice any performance
>> difference as long as their idle CPU packages are idle, so the
>> environmentalist in me tells me that this is a bad idea. ;)
>
> Certainly it sounds like a bad idea from the point of view of anyone who
> wants to save energy, but it's how the machine seems to work (at least in
> its current configuration, which is not entirely under my control).
>
Yes that seems to be how it works right now, but honestly it seems like
an idle management bug to me.
> Note also that of the benchmarks, only avrora has the property of often
> using only one of the sockets. The others let their threads drift around
> more.
>
>>
>> >
>> >> I'm guessing that if you fully disable one of your CPU packages and
>> >> repeat the previous experiment forcing various P-states between 10 and
>> >> 37 you should get a maximum efficiency ratio closer to the theoretical
>> >> one for this CPU?
>> >
>> > OK, but that's not really a natural usage context... I do have a
>> > one-socket Intel 5220. I'll see what happens there.
>> >
>>
>> Fair, I didn't intend to suggest you take it offline manually every time
>> you don't plan to use it, my suggestion was just intended as an
>> experiment to help us confirm or disprove the theory that the reason for
>> the deviation from reality of your reported maximum efficiency ratio is
>> the presence of that second CPU package with broken idle management. If
>> that's the case the P-state vs. energy usage plot should show a minimum
>> closer to the ideal maximum efficiency ratio after disabling the second
>> CPU package.
>
> More numbers are attached. Pages 1-3 have two socket machines. Page 4
> has a one socket machine. The values for p state 20 are highlighted.
> For avrora (the one-socket application) on page 2, 20 is not the pstate
> with the lowest CPU energy consumption. 35 and 37 do better. Also for
> xalan on page 4 (one-socket machine) 15 does slightly better than 20.
> Otherwise, 20 always seems to be the best.
>
It seems like your results suggest that the presence of a second CPU
package cannot be the only factor leading to this deviation, however
it's hard to tell how much of an influence it's having on that
deviation, since your single- and dual-socket samples are taken from
machines with different CPUs so it's unclear whether moving to a single
CPU has led to a shift of the maximum efficiency frequency, and if it
has it may have had a smaller impact than the ~5 P-state granularity of
your samples.
Either way it seems like we're greatly underestimating the maximum
efficiency frequency even on your single-socket system. The reason may
still be suboptimal idle management -- I hope it is, since the
alternative that your processor is lying about its maximum efficiency
ratio seems far more difficult to deal with as some generic software
change...
>> > I did some experiements with forcing different frequencies. I haven't
>> > finished processing the results, but I notice that as the frequency goes
>> > up, the utilization (specifically the value of
>> > map_util_perf(sg_cpu->util) at the point of the call to
>> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
>> > Is this expected?
>> >
>>
>> Actually, it *is* expected based on our previous hypothesis that these
>> workloads are largely latency-bound: In cases where a given burst of CPU
>> work is not parallelizable with any other tasks the thread needs to
>> complete subsequently, its overall runtime will decrease monotonically
>> with increasing frequency, therefore the number of instructions executed
>> per unit of time will increase monotonically with increasing frequency,
>> and with it its frequency-invariant utilization.
>
> I'm not sure. If you have two tasks, each one alternately waiting for the
> other, if the frequency doubles, they will each run faster and wait less,
> but as long as one is computing the utilization in a small interval, ie
> before the application ends, the utilization will always be 50%.
Not really, because we're talking about frequency-invariant utilization
rather than just the CPU's duty cycle (which may indeed remain at 50%
regardless). If the frequency doubles and the thread is still active
50% of the time its frequency-invariant utilization will also double,
since the thread would be utilizing twice as many computational
resources per unit of time as before. As you can see in the definition
in [1], the frequency-invariant utilization is scaled by the running
frequency of the thread.
[1] Documentation/scheduler/sched-capacity.rst
> The applications, however, are probably not as simple as this.
>
> julia
>
>> > thanks,
>> > julia
>> >
>> >> > 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
>> >> > 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
>> >> > 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
>> >> > 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
>> >> > 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
>> >> > 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
>> >> > 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
>> >> > 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
>> >> > 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
>> >> > 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
>> >> > 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
>> >> > 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
>> >> > 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
>> >> > 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
>> >> > 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
>> >> > 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
>> >> > 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
>> >> > 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
>> >> > 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
>> >> > 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
>> >> > 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
>> >> > 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
>> >> > 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
>> >> > 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
>> >> > 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
>> >> > 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
>> >> > 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
>> >> > 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
>> >> > 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
>> >> > 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
>> >> > 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
>> >> > 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
>> >> > 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
>> >> > 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
>> >> > 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
>> >> > 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
>> >> > 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
>> >> > 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
>> >> > 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
>> >> > 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
>> >> > 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
>> >> > 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
>> >> > 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
>> >> > 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
>> >> > 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
>> >> > 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
>> >> > 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
>> >> > 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
>> >> > 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
>> >> > 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
>> >> > 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
>> >> > 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
>> >> > 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
>> >> > 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
>> >> > 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
>> >> > 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
>> >> > 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
>> >> > 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
>> >> > 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
>> >> > 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
>> >> > 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
>> >> > 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
>> >> > 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60
>> >>
>>
Powered by blists - more mailing lists