[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.22.394.2112192312520.3181@hadrien>
Date: Sun, 19 Dec 2021 23:41:07 +0100 (CET)
From: Julia Lawall <julia.lawall@...ia.fr>
To: Francisco Jerez <currojerez@...eup.net>
cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
Len Brown <lenb@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Linux PM <linux-pm@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range
On Sun, 19 Dec 2021, Francisco Jerez wrote:
> Julia Lawall <julia.lawall@...ia.fr> writes:
>
> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >
> >> Julia Lawall <julia.lawall@...ia.fr> writes:
> >>
> >> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >> >
> >> >> Julia Lawall <julia.lawall@...ia.fr> writes:
> >> >>
> >> >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> >> >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> >> >> >> Ratio (R/O)". However that seems to deviate massively from the most
> >> >> >> efficient ratio on your system, which may indicate a firmware bug, some
> >> >> >> sort of clock gating problem, or an issue with the way that
> >> >> >> intel_pstate.c processes this information.
> >> >> >
> >> >> > I'm not sure to understand the bug part. min_pstate gives the frequency
> >> >> > that I find as the minimum frequency when I look for the specifications of
> >> >> > the CPU. Should one expect that it should be something different?
> >> >> >
> >> >>
> >> >> I'd expect the minimum frequency on your processor specification to
> >> >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> >> >> since there's little reason to claim your processor can be clocked down
> >> >> to a frequency which is inherently inefficient /and/ slower than the
> >> >> maximum efficiency ratio -- In fact they both seem to match in your
> >> >> system, they're just nowhere close to the frequency which is actually
> >> >> most efficient, which smells like a bug, like your processor
> >> >> misreporting what the most efficient frequency is, or it deviating from
> >> >> the expected one due to your CPU static power consumption being greater
> >> >> than it would be expected to be under ideal conditions -- E.g. due to
> >> >> some sort of clock gating issue, possibly due to a software bug, or due
> >> >> to our scheduling of such workloads with a large amount of lightly
> >> >> loaded threads being unnecessarily inefficient which could also be
> >> >> preventing most of your CPU cores from ever being clock-gated even
> >> >> though your processor may be sitting idle for a large fraction of their
> >> >> runtime.
> >> >
> >> > The original mail has results from two different machines: Intel 6130
> >> > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
> >> > of 6130s and 5218s. I can try them.
> >> >
> >> > I tried 5.9 in which I just commented out the schedutil code to make
> >> > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
> >> > pauses) and in both case the execution is almost entirely in the turbo
> >> > frequencies.
> >> >
> >> > I'm not sure to understand the term "clock-gated". What C state does that
> >> > correspond to? The turbostat output for one run of avrora is below.
> >> >
> >>
> >> I didn't have any specific C1+ state in mind, most of the deeper ones
> >> implement some sort of clock gating among other optimizations, I was
> >> just wondering whether some sort of software bug and/or the highly
> >> intermittent CPU utilization pattern of these workloads are preventing
> >> most of your CPU cores from entering deep sleep states. See below.
> >>
> >> > julia
> >> >
> >> > 78.062895 sec
> >> > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
> >> > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
> >> > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
> >>
> >> This seems suspicious: ^^^^ ^^^^^^^
> >>
> >> I hadn't understood that you're running this on a dual-socket system
> >> until I looked at these results.
> >
> > Sorry not to have mentioned that.
> >
> >> It seems like package #0 is doing
> >> pretty much nothing according to the stats below, but it's still
> >> consuming nearly half of your energy, apparently because the idle
> >> package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
> >> 0%). That could explain your unexpectedly high static power consumption
> >> and the deviation of the real maximum efficiency frequency from the one
> >> reported by your processor, since the reported maximum efficiency ratio
> >> cannot possibly take into account the existence of a second CPU package
> >> with dysfunctional idle management.
> >
> > Our assumption was that if anything happens on any core, all of the
> > packages remain in a state that allows them to react in a reasonable
> > amount of time ot any memory request.
>
> I can see how that might be helpful for workloads that need to be able
> to unleash the whole processing power of your multi-socket system with
> minimal latency, but the majority of multi-socket systems out there with
> completely idle CPU packages are unlikely to notice any performance
> difference as long as their idle CPU packages are idle, so the
> environmentalist in me tells me that this is a bad idea. ;)
Certainly it sounds like a bad idea from the point of view of anyone who
wants to save energy, but it's how the machine seems to work (at least in
its current configuration, which is not entirely under my control).
Note also that of the benchmarks, only avrora has the property of often
using only one of the sockets. The others let their threads drift around
more.
>
> >
> >> I'm guessing that if you fully disable one of your CPU packages and
> >> repeat the previous experiment forcing various P-states between 10 and
> >> 37 you should get a maximum efficiency ratio closer to the theoretical
> >> one for this CPU?
> >
> > OK, but that's not really a natural usage context... I do have a
> > one-socket Intel 5220. I'll see what happens there.
> >
>
> Fair, I didn't intend to suggest you take it offline manually every time
> you don't plan to use it, my suggestion was just intended as an
> experiment to help us confirm or disprove the theory that the reason for
> the deviation from reality of your reported maximum efficiency ratio is
> the presence of that second CPU package with broken idle management. If
> that's the case the P-state vs. energy usage plot should show a minimum
> closer to the ideal maximum efficiency ratio after disabling the second
> CPU package.
More numbers are attached. Pages 1-3 have two socket machines. Page 4
has a one socket machine. The values for p state 20 are highlighted.
For avrora (the one-socket application) on page 2, 20 is not the pstate
with the lowest CPU energy consumption. 35 and 37 do better. Also for
xalan on page 4 (one-socket machine) 15 does slightly better than 20.
Otherwise, 20 always seems to be the best.
> > I did some experiements with forcing different frequencies. I haven't
> > finished processing the results, but I notice that as the frequency goes
> > up, the utilization (specifically the value of
> > map_util_perf(sg_cpu->util) at the point of the call to
> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> > Is this expected?
> >
>
> Actually, it *is* expected based on our previous hypothesis that these
> workloads are largely latency-bound: In cases where a given burst of CPU
> work is not parallelizable with any other tasks the thread needs to
> complete subsequently, its overall runtime will decrease monotonically
> with increasing frequency, therefore the number of instructions executed
> per unit of time will increase monotonically with increasing frequency,
> and with it its frequency-invariant utilization.
I'm not sure. If you have two tasks, each one alternately waiting for the
other, if the frequency doubles, they will each run faster and wait less,
but as long as one is computing the utilization in a small interval, ie
before the application ends, the utilization will always be 50%. The
applications, however, are probably not as simple as this.
julia
> > thanks,
> > julia
> >
> >> > 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
> >> > 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
> >> > 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
> >> > 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
> >> > 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
> >> > 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
> >> > 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
> >> > 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
> >> > 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
> >> > 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
> >> > 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
> >> > 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
> >> > 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
> >> > 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
> >> > 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
> >> > 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
> >> > 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
> >> > 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> >> > 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> >> > 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
> >> > 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
> >> > 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> >> > 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> >> > 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
> >> > 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
> >> > 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
> >> > 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
> >> > 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
> >> > 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
> >> > 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
> >> > 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
> >> > 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
> >> > 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
> >> > 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
> >> > 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
> >> > 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
> >> > 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
> >> > 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
> >> > 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
> >> > 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
> >> > 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
> >> > 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
> >> > 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
> >> > 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
> >> > 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
> >> > 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
> >> > 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
> >> > 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
> >> > 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
> >> > 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
> >> > 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
> >> > 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
> >> > 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
> >> > 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
> >> > 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
> >> > 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
> >> > 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
> >> > 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
> >> > 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
> >> > 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
> >> > 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
> >> > 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
> >> > 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60
> >>
>
Download attachment "h2.pdf" of type "application/pdf" (70930 bytes)
Powered by blists - more mailing lists