[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <000001d59bc9$aab4e010$001ea030$@net>
Date: Fri, 15 Nov 2019 07:30:47 -0800
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Peter Zijlstra'" <peterz@...radead.org>
Cc: "'linux-kernel'" <linux-kernel@...r.kernel.org>,
"'Ingo Molnar'" <mingo@...hat.com>,
"'Dietmar Eggemann'" <dietmar.eggemann@....com>,
"'Juri Lelli'" <juri.lelli@...hat.com>,
"'Steven Rostedt'" <rostedt@...dmis.org>,
"'Mel Gorman'" <mgorman@...e.de>,
"'open list:THERMAL'" <linux-pm@...r.kernel.org>,
"'Linus Torvalds'" <torvalds@...ux-foundation.org>,
"'Thomas Gleixner'" <tglx@...utronix.de>,
"'Sargun Dhillon'" <sargun@...gun.me>,
"'Tejun Heo'" <tj@...nel.org>, "'Xie XiuQi'" <xiexiuqi@...wei.com>,
<xiezhipeng1@...wei.com>,
"'Srinivas Pandruvada'" <srinivas.pandruvada@...ux.intel.com>,
"'Vincent Guittot'" <vincent.guittot@...aro.org>
Subject: RE: [PATCH v4] sched/freq: move call to cpufreq_update_util
Hi Peter,
On 2019.11.15 05:02 Peter Zijlstra wrote:
> On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:
>
>> This patch does 2 things:
>> - fix the spurious call to cpufreq just before attaching a task
>
> Right, so that one doesn't concern me too much.
>
>> - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl
>
> But per the rq->has_blocked_load logic we would mostly stop sending
> events once we reach all 0s.
>
> Now, most of those updates will be through _nohz_idle_balance() ->
> update_nohz_stats(), which are remote, which means intel_pstate is
> ignoring them anyway.
>
> Now the _nohz_idle_balance() -> update_blocked_averages() thing runs
> local, and that will update the one random idle CPU we picked to run
> nohz balance, but all others will be left where they were.
>
> So why does intel_pstate care... Esp. on SKL+ with per-core P state this
> is of dubious value.
>
> Also, and maybe I should go read back, why do we care what the P state
> is when we're mostly in C states anyway? These are all idle CPUs,
> otherwise we wouldkn't be running update_blocked_averages() on them
> anyway.
>
> Much confusion..
Background:
It is true that this is very likely a rare use case.
Apparently, I can make my test system considerably more "idle"
than most.
For many years now, I have never seen the time between calls,
per CPU, to the intel_pstate driver exceed 4 seconds.
Then as of:
sched/fair: Fix O(nr_cgroups) in load balance path
and for an idle system, the time between calls could
be as much as a few hundred seconds. Myself, and not
knowing much (anything) about scheduler details, I found
this odd, and so investigated.
And yes, so who cares if we are in deep C states anyhow?
If, for whatever reason, the system is running with
"intel_idle.max_cstate=1" my findings were that
the processor could end up consuming a lot more energy
for a long long time. Why? Because, at least for my
processor, and older i7-2600K (no HWP), in idle state 1, the
CPU does not relinquish its vote to the PLL, and with
no calls to the driver the requested p-state doesn't decay.
Not previously mentioned: The situation is considerably
exasperated by this piece of boost code within the intel_pstate
driver:
/*
* If the average P-state during the previous cycle was higher than the
* current target, add 50% of the difference to the target to reduce
* possible performance oscillations and offset possible performance
* loss related to moving the workload from one CPU to another within
* a package/module.
*/
avg_pstate = get_avg_pstate(cpu);
if (avg_pstate > target)
target += (avg_pstate - target) >> 1;
Hope this helps.
... Doug
Powered by blists - more mailing lists