linux-kernel - Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.11.1408181400220.29347@knanqh.ubzr>
Date:	Mon, 18 Aug 2014 14:25:04 -0400 (EDT)
From:	Nicolas Pitre <nicolas.pitre@...aro.org>
To:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
cc:	alex.shi@...el.com, vincent.guittot@...aro.org,
	peterz@...radead.org, pjt@...gle.com, efault@....de,
	rjw@...ysocki.net, morten.rasmussen@....com,
	svaidy@...ux.vnet.ibm.com, arjan@...ux.intel.com, mingo@...nel.org,
	len.brown@...el.com, yuyang.du@...el.com,
	linaro-kernel@...ts.linaro.org, daniel.lezcano@...aro.org,
	corbet@....net, catalin.marinas@....com, markgross@...gnar.org,
	sundar.iyer@...el.com, linux-kernel@...r.kernel.org,
	dietmar.eggemann@....com, Lorenzo.Pieralisi@....com,
	mike.turquette@...aro.org, akpm@...ux-foundation.org,
	paulmck@...ux.vnet.ibm.com, tglx@...utronix.de
Subject: Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into
 the scheduler

On Mon, 18 Aug 2014, Preeti U Murthy wrote:

> On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
> > On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> > 
> >> The goal of the power aware scheduling design is to integrate all
> >> policy, metrics and averaging into the scheduler. Today the
> >> cpu power management is fragmented and hence inconsistent.
> >>
> >> As a first step towards this integration, rid the cpuidle state management
> >> of the governors. Retain only the cpuidle driver in the cpu idle
> >> susbsystem which acts as an interface between the scheduler and low
> >> level platform specific cpuidle drivers. For all decision making around
> >> selection of idle states,the cpuidle driver falls back to the scheduler.
> >>
> >> The current algorithm for idle state selection is the same as the logic used
> >> by the menu governor. However going ahead the heuristics will be tuned and
> >> improved upon with metrics better known to the scheduler.
> > 
> > I'd strongly suggest a different approach here.  Instead of copying the 
> > menu governor code and tweaking it afterwards, it would be cleaner to 
> > literally start from scratch with a new governor.  Said new governor 
> > would grow inside the scheduler with more design freedom instead of 
> > being strapped on the side.
> > 
> > By copying existing code, the chance for cruft to remain for a long time 
> > is close to 100%. We already have one copy of it, let's keep it working 
> > and start afresh instead.
> > 
> > By starting clean it is way easier to explain and justify additions to a 
> > new design than convincing ourselves about the removal of no longer 
> > needed pieces from a legacy design.
> 
> Ok. The reason I did it this way was that I did not find anything
> grossly wrong in the current cpuidle governor algorithm. Of course this
> can be improved but I did not see strong reasons to completely wipe it
> away. I see good scope to improve upon the existing algorithm with
> additional knowledge of *the idle states being mapped to scheduling
> domains*. This will in itself give us a better algorithm and does not
> mandate significant changes from the current algorithm. So I really
> don't see why we need to start from scratch.

Sure the current algorithm can be improved.  But it has its limitations 
by design.  And simply making it more topology aware wouldn't justify 
moving it into the scheduler.

What we're contemplating is something completely integrated with the 
scheduler where cpuidle and cpufreq (and eventually thermal management) 
together are part of the same "governor" to provide global decisions on 
all 
fronts.

Not only should the next wake-up event be predicted, but also the 
anticipated system load, etc.  The scheduler may know that a given CPU 
is unlikely to be used for a while and could call for the deepest 
C-state right away without waiting for the current menu heuristic to 
converge.

There is also Daniel's I/O latency tracking that could replace the menu 
governor latency guessing, the later based on heuristics that could be 
described as black magic.

And all this has to eventually be policed by a global performance/power 
concern that should weight C-states, P-states and task placement 
together and select the best combination (Morten's work).

Therefore the current menu algorithm won't do it. It simply wasn't 
designed for that.

We'll have the opportunity to discuss this further tomorrow anyway.

> The primary issue that I found was that with the goal being power aware
> scheduler we must ensure that the possibility of a governor getting
> registered with cpuidle to choose idle states no longer will exist. The
> reason being there is just *one entity who will take this decision and
> there is no option about it*. This patch intends to bring the focus to
> this specific detail.

I think there is nothing wrong with having multiple governors being 
registered.  We simply decide at runtime via sysfs which one has control 
over the low-level cpuidle drivers.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/