[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dfddfd651256472fa1b7c9db2a4dcb54@MSX-L104.msx.ad.zih.tu-dresden.de>
Date: Sat, 17 Mar 2018 13:42:19 +0100
From: Thomas Ilsche <thomas.ilsche@...dresden.de>
To: "Rafael J. Wysocki" <rjw@...ysocki.net>,
Peter Zijlstra <peterz@...radead.org>,
Linux PM <linux-pm@...r.kernel.org>,
"Frederic Weisbecker" <fweisbec@...il.com>
CC: Thomas Gleixner <tglx@...utronix.de>,
Paul McKenney <paulmck@...ux.vnet.ibm.com>,
Doug Smythies <dsmythies@...us.net>,
"Rik van Riel" <riel@...riel.com>,
Aubrey Li <aubrey.li@...ux.intel.com>,
"Mike Galbraith" <mgalbraith@...e.de>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFT][PATCH v5 0/7] sched/cpuidle: Idle loop rework
Over the last week I tested v4+pollv2 and now v5+pollv3. With v5, I
observe a particular idle behavior, that I have not seen before with
v4. On a dual-socket Skylake system the idle power increases from
74.1 W (system total) to 85.5 W with a 300 HZ build and even to
138.3 W with a 1000 HZ build. A similar Haswell-EP system is also
affected.
There are phases during which one core will keep switching to the
highest C-state, but not disable the sched tick. Every 4th sched tick,
a kworker on that core is scheduled shortly. Every wakeup from C6 of a
single core will more than double the package power consumption of
*both8 sockets for ~500 us resulting in the significantly increased
sustained power consumption.
This is illustrated in [1]. For a comparison of a "normal" phase
(samekernel), see [2]. For a global view of the effect on a 1000 Hz
build, see [3].
I have not yet found any particular triggers or the specific
interaction between the sched tick and the kworker. I'm not sure how
this was introduced in v5. I would guess it could be a feedback loop
that I was concerned about initially.
I have more findings from v4, but this seems much more impactful.
[1] https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/rjwv5_idle_300Hz.png
[2] https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/rjwv5_idle_300Hz_ok.png
[3] https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/rjwv5_idle_1000Hz.png
On 2018-03-15 22:59, Rafael J. Wysocki wrote:
> Hi All,
>
> Thanks a lot for the feedback so far!
>
> One more respin after the last batch of comments from Peter and Frederic.
>
> The previous summary that still applies:
>
> On Sunday, March 4, 2018 11:21:30 PM CET Rafael J. Wysocki wrote:
>>
>> The problem is that if we stop the sched tick in
>> tick_nohz_idle_enter() and then the idle governor predicts short idle
>> duration, we lose regardless of whether or not it is right.
>>
>> If it is right, we've lost already, because we stopped the tick
>> unnecessarily. If it is not right, we'll lose going forward, because
>> the idle state selected by the governor is going to be too shallow and
>> we'll draw too much power (that has been reported recently to actually
>> happen often enough for people to care).
>>
>> This patch series is an attempt to improve the situation and the idea
>> here is to make the decision whether or not to stop the tick deeper in
>> the idle loop and in particular after running the idle state selection
>> in the path where the idle governor is invoked. This way the problem
>> can be avoided, because the idle duration predicted by the idle governor
>> can be used to decide whether or not to stop the tick so that the tick
>> is only stopped if that value is large enough (and, consequently, the
>> idle state selected by the governor is deep enough).
>>
>> The series tires to avoid adding too much new code, rather reorder the
>> existing code and make it more fine-grained.
>>
>> Patch 1 prepares the tick-sched code for the subsequent modifications and it
>> doesn't change the code's functionality (at least not intentionally).
>>
>> Patch 2 starts pushing the tick stopping decision deeper into the idle
>> loop, but that is limited to do_idle() and tick_nohz_irq_exit().
>>
>> Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
>> and sets the stage for the subsequent changes.
>>
>> Patch 4 adds a bool pointer argument to cpuidle_select() and the ->select
>> governor callback allowing them to return a "nohz" hint on whether or not to
>> stop the tick to the caller. It also adds code to decide what value to
>> return as "nohz" to the menu governor.
>>
>> Patch 5 reorders the idle state selection with respect to the stopping of
>> the tick and causes the additional "nohz" hint from cpuidle_select() to be
>> used for deciding whether or not to stop the tick.
>>
>> Patch 6 causes the menu governor to refine the state selection in case the
>> tick is not going to be stopped and the already selected state may not fit
>> before the next tick time.
>>
>> Patch 7 Deals with the situation in which the tick was stopped previously,
>> but the idle governor still predicts short idle.
>
> This series is complementary to the poll_idle() patch at
>
> https://patchwork.kernel.org/patch/10282237/
>
> Thanks,
> Rafael
>
--
Dipl. Inf. Thomas Ilsche
Computer Scientist
Highly Adaptive Energy-Efficient Computing
CRC 912 HAEC: http://tu-dresden.de/sfb912
Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany
Phone: +49 351 463-42168
Fax: +49 351 463-37773
E-Mail: thomas.ilsche@...dresden.de
Download attachment "smime.p7s" of type "application/pkcs7-signature" (5214 bytes)
Powered by blists - more mailing lists