linux-kernel - Re: cpufreq: intel_pstate: map utilization into the pstate range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0he+_p5qVkx+fGUg7BCBYmm5yRh4q-_9jgJoZLwDf1c2Q@mail.gmail.com>
Date:   Sun, 19 Dec 2021 15:19:07 +0100
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Julia Lawall <julia.lawall@...ia.fr>
Cc:     Francisco Jerez <currojerez@...eup.net>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Len Brown <lenb@...nel.org>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Linux PM <linux-pm@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Sun, Dec 19, 2021 at 7:42 AM Julia Lawall <julia.lawall@...ia.fr> wrote:
>
>
>
> On Sat, 18 Dec 2021, Francisco Jerez wrote:
>
> > Julia Lawall <julia.lawall@...ia.fr> writes:
> >
> > > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> > >
> > >> Julia Lawall <julia.lawall@...ia.fr> writes:
> > >>
> > >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> > >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> > >> >> Ratio (R/O)".  However that seems to deviate massively from the most
> > >> >> efficient ratio on your system, which may indicate a firmware bug, some
> > >> >> sort of clock gating problem, or an issue with the way that
> > >> >> intel_pstate.c processes this information.
> > >> >
> > >> > I'm not sure to understand the bug part.  min_pstate gives the frequency
> > >> > that I find as the minimum frequency when I look for the specifications of
> > >> > the CPU.  Should one expect that it should be something different?
> > >> >
> > >>
> > >> I'd expect the minimum frequency on your processor specification to
> > >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> > >> since there's little reason to claim your processor can be clocked down
> > >> to a frequency which is inherently inefficient /and/ slower than the
> > >> maximum efficiency ratio -- In fact they both seem to match in your
> > >> system, they're just nowhere close to the frequency which is actually
> > >> most efficient, which smells like a bug, like your processor
> > >> misreporting what the most efficient frequency is, or it deviating from
> > >> the expected one due to your CPU static power consumption being greater
> > >> than it would be expected to be under ideal conditions -- E.g. due to
> > >> some sort of clock gating issue, possibly due to a software bug, or due
> > >> to our scheduling of such workloads with a large amount of lightly
> > >> loaded threads being unnecessarily inefficient which could also be
> > >> preventing most of your CPU cores from ever being clock-gated even
> > >> though your processor may be sitting idle for a large fraction of their
> > >> runtime.
> > >
> > > The original mail has results from two different machines: Intel 6130
> > > (skylake) and Intel 5218 (cascade lake).  I have access to another cluster
> > > of 6130s and 5218s.  I can try them.
> > >
> > > I tried 5.9 in which I just commented out the schedutil code to make
> > > frequency requests.  I only tested avrora (tiny pauses) and h2 (longer
> > > pauses) and in both case the execution is almost entirely in the turbo
> > > frequencies.
> > >
> > > I'm not sure to understand the term "clock-gated".  What C state does that
> > > correspond to?  The turbostat output for one run of avrora is below.
> > >
> >
> > I didn't have any specific C1+ state in mind, most of the deeper ones
> > implement some sort of clock gating among other optimizations, I was
> > just wondering whether some sort of software bug and/or the highly
> > intermittent CPU utilization pattern of these workloads are preventing
> > most of your CPU cores from entering deep sleep states.  See below.
> >
> > > julia
> > >
> > > 78.062895 sec
> > > Package Core  CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     POLL    C1      C1E     C6      POLL%   C1%     C1E%    C6%     CPU%c1  CPU%c6  CoreTmp PkgTmp  Pkg%pc2 Pkg%pc6 Pkg_J   RAM_J   PKG_%   RAM_%
> > > -     -       -       31      2.95    1065    2096    156134  0       1971    155458  2956270 657130  0.00    0.20    4.78    92.26   14.75   82.31   40      41      45.14   0.04    4747.52 2509.05 0.00    0.00
> > > 0     0       0       13      1.15    1132    2095    11360   0       0       2       39      19209   0.00    0.00    0.01    99.01   8.02    90.83   39      41      90.24   0.04    2266.04 1346.09 0.00    0.00
> >
> > This seems suspicious:                                                                                                                                                          ^^^^    ^^^^^^^
> >
> > I hadn't understood that you're running this on a dual-socket system
> > until I looked at these results.
>
> Sorry not to have mentioned that.
>
> > It seems like package #0 is doing
> > pretty much nothing according to the stats below, but it's still
> > consuming nearly half of your energy, apparently because the idle
> > package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
> > 0%).  That could explain your unexpectedly high static power consumption
> > and the deviation of the real maximum efficiency frequency from the one
> > reported by your processor, since the reported maximum efficiency ratio
> > cannot possibly take into account the existence of a second CPU package
> > with dysfunctional idle management.
>
> Our assumption was that if anything happens on any core, all of the
> packages remain in a state that allows them to react in a reasonable
> amount of time ot any memory request.
>
> > I'm guessing that if you fully disable one of your CPU packages and
> > repeat the previous experiment forcing various P-states between 10 and
> > 37 you should get a maximum efficiency ratio closer to the theoretical
> > one for this CPU?
>
> OK, but that's not really a natural usage context...  I do have a
> one-socket Intel 5220.  I'll see what happens there.
>
> I did some experiements with forcing different frequencies.  I haven't
> finished processing the results, but I notice that as the frequency goes
> up, the utilization (specifically the value of
> map_util_perf(sg_cpu->util) at the point of the call to
> cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> Is this expected?

It isn't, as long as the scale-invariance mechanism mentioned in my
previous message works properly.