lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0g5wDxYXA-V=Ex_Md82hgnj5K6Vr0tavFFVz=uBqo8wag@mail.gmail.com>
Date:   Thu, 30 Dec 2021 18:03:33 +0100
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Julia Lawall <julia.lawall@...ia.fr>
Cc:     "Rafael J. Wysocki" <rafael@...nel.org>,
        Francisco Jerez <currojerez@...eup.net>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Len Brown <lenb@...nel.org>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Linux PM <linux-pm@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Wed, Dec 29, 2021 at 10:13 AM Julia Lawall <julia.lawall@...ia.fr> wrote:
>
>
>
> On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Tue, Dec 28, 2021 at 6:46 PM Julia Lawall <julia.lawall@...ia.fr> wrote:
> > >
> > >
> > >
> > > On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
> > >
> > > > On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <julia.lawall@...ia.fr> wrote:
> > > > >
> > > > > I looked a bit more into why pstate 20 is always using the least energy. I
> > > > > have just one thread spinning for 10 seconds, I use a fixed value for the
> > > > > pstate, and I measure the energy usage with turbostat.
> > > >
> > > > How exactly do you fix the pstate?
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index e7af18857371..19440b15454c 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -400,7 +402,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> > >                 sg_cpu->util = prev_util;
> > >
> > >         cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> > > -                                  map_util_perf(sg_cpu->util), sg_cpu->max);
> > > +                                  sysctl_sched_fixedfreq, sg_cpu->max);
> >
> > This is just changing the "target" hint given to the processor which
> > may very well ignore it, though.
> >
> > >
> > >         sg_cpu->sg_policy->last_freq_update_time = time;
> > >  }
> > >
> > > ------------------------------
> > >
> > > sysctl_sched_fixedfreq is a variable that I added to sysfs.
> >
> > If I were trying to fix a pstate, I would set scaling_max_freq and
> > scaling_min_freq in sysfs for all CPUs to the same value.
> >
> > That would cause intel_pstate to set HWP min and max to the same value
> > which should really cause the pstate to be fixed, at least outside the
> > turbo range of pstates.
>
> The effect is the same.  But that approach is indeed simpler than patching
> the kernel.

It is also applicable when intel_pstate runs in the active mode.

As for the results that you have reported, it looks like the package
power on these systems is dominated by package voltage and going from
P-state 20 to P-state 21 causes that voltage to increase significantly
(the observed RAM energy usage pattern is consistent with that).  This
means that running at P-states above 20 is only really justified if
there is a strict performance requirement that can't be met otherwise.

Can you please check what value is there in the base_frequency sysfs
attribute under cpuX/cpufreq/?

I'm guessing that the package voltage level for P-states 10 and 20 is
the same, so the power difference between them is not significant
relative to the difference between P-state 20 and 21 and if increasing
the P-state causes some extra idle time to appear in the workload
(even though there is not enough of it to prevent to overall
utilization from increasing), then the overall power draw when running
at P-state 10 may be greater that for P-state 20.

You can check if there is any C-state residency difference between
these two cases by running the workload under turbostat in each of
them.

Anyway, this is a configuration in which the HWP scaling algorithm
used when intel_pstate runs in the active mode is likely to work
better, because it should take the processor design into account.
That's why it is the default configuration of intel_pstate on systems
with HWP.  There are cases in which schedutil helps, but that's mostly
when HWP without it tends to run the workload too fast, because it
lacks the utilization history provided by PELT.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ