linux-kernel - RE: [PATCH v4] sched/freq: move call to cpufreq_update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <000001d59bc9$aab4e010$001ea030$@net>
Date:   Fri, 15 Nov 2019 07:30:47 -0800
From:   "Doug Smythies" <dsmythies@...us.net>
To:     "'Peter Zijlstra'" <peterz@...radead.org>
Cc:     "'linux-kernel'" <linux-kernel@...r.kernel.org>,
        "'Ingo Molnar'" <mingo@...hat.com>,
        "'Dietmar Eggemann'" <dietmar.eggemann@....com>,
        "'Juri Lelli'" <juri.lelli@...hat.com>,
        "'Steven Rostedt'" <rostedt@...dmis.org>,
        "'Mel Gorman'" <mgorman@...e.de>,
        "'open list:THERMAL'" <linux-pm@...r.kernel.org>,
        "'Linus Torvalds'" <torvalds@...ux-foundation.org>,
        "'Thomas Gleixner'" <tglx@...utronix.de>,
        "'Sargun Dhillon'" <sargun@...gun.me>,
        "'Tejun Heo'" <tj@...nel.org>, "'Xie XiuQi'" <xiexiuqi@...wei.com>,
        <xiezhipeng1@...wei.com>,
        "'Srinivas Pandruvada'" <srinivas.pandruvada@...ux.intel.com>,
        "'Vincent Guittot'" <vincent.guittot@...aro.org>
Subject: RE: [PATCH v4] sched/freq: move call to cpufreq_update_util

Hi Peter,

On 2019.11.15 05:02 Peter Zijlstra wrote:
> On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:
>
>> This patch does 2 things:
>> - fix the spurious call to cpufreq just before attaching a task
>
> Right, so that one doesn't concern me too much.
>
>> - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl
>
> But per the rq->has_blocked_load logic we would mostly stop sending
> events once we reach all 0s.
>
> Now, most of those updates will be through _nohz_idle_balance() ->
> update_nohz_stats(), which are remote, which means intel_pstate is
> ignoring them anyway.
>
> Now the _nohz_idle_balance() -> update_blocked_averages() thing runs
> local, and that will update the one random idle CPU we picked to run
> nohz balance, but all others will be left where they were.
>
> So why does intel_pstate care... Esp. on SKL+ with per-core P state this
> is of dubious value.
>
> Also, and maybe I should go read back, why do we care what the P state
> is when we're mostly in C states anyway? These are all idle CPUs,
> otherwise we wouldkn't be running update_blocked_averages() on them
> anyway.
>
> Much confusion..

Background:

It is true that this is very likely a rare use case.
Apparently, I can make my test system considerably more "idle"
than most.

For many years now, I have never seen the time between calls,
per CPU, to the intel_pstate driver exceed 4 seconds.

Then as of:
sched/fair: Fix O(nr_cgroups) in load balance path
and for an idle system, the time between calls could
be as much as a few hundred seconds. Myself, and not
knowing much (anything) about scheduler details, I found
this odd, and so investigated.

And yes, so who cares if we are in deep C states anyhow?
If, for whatever reason, the system is running with
"intel_idle.max_cstate=1" my findings were that
the processor could end up consuming a lot more energy
for a long long time. Why? Because, at least for my
processor, and older i7-2600K (no HWP), in idle state 1, the
CPU does not relinquish its vote to the PLL, and with
no calls to the driver the requested p-state doesn't decay.

Not previously mentioned: The situation is considerably
exasperated by this piece of boost code within the intel_pstate
driver:

        /*
         * If the average P-state during the previous cycle was higher than the
         * current target, add 50% of the difference to the target to reduce
         * possible performance oscillations and offset possible performance
         * loss related to moving the workload from one CPU to another within
         * a package/module.
         */
        avg_pstate = get_avg_pstate(cpu);
        if (avg_pstate > target)
                target += (avg_pstate - target) >> 1;

Hope this helps.

... Doug