linux-kernel - Re: power-efficient scheduling design

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130618190625.GA9065@MacBook-Pro.local>
Date:	Tue, 18 Jun 2013 20:06:25 +0100
From:	Catalin Marinas <catalin.marinas@....com>
To:	Arjan van de Ven <arjan@...ux.intel.com>
Cc:	Morten Rasmussen <Morten.Rasmussen@....com>,
	Ingo Molnar <mingo@...nel.org>,
	"alex.shi@...el.com" <alex.shi@...el.com>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"preeti@...ux.vnet.ibm.com" <preeti@...ux.vnet.ibm.com>,
	"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
	"efault@....de" <efault@....de>, "pjt@...gle.com" <pjt@...gle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linaro-kernel@...ts.linaro.org" <linaro-kernel@...ts.linaro.org>,
	"len.brown@...el.com" <len.brown@...el.com>,
	"corbet@....net" <corbet@....net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>
Subject: Re: power-efficient scheduling design

On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
> On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
> > Looking at the discussion it seems that people have slightly different
> > views, but most agree that the goal is an integrated scheduling,
> > frequency, and idle policy like you pointed out from the beginning.
> 
> ... except that such a solution does not really work for Intel hardware.

I think it can work (see below).

> The OS does not get to really pick the CPU "frequency" (never mind that
> frequency is not what gets controlled), the hardware picks the frequency.
> The OS can do some level of requests (best to think of this as a percentage
> more than frequency) but what you actually get is more often than not
> what you asked for.

Morten's proposal does not try to "pick" a frequency. The P-state change
is still done gradually based on the load (so we still have an adaptive
loop). The load (total or per-task) can be tracked in an arch-specific
way (using aperf/mperf on x86).

The difference from what intel_pstate.c does now is that it has a view
of the total load (across all CPUs) and the run-queue content. It can
"guide" the load balancer into favouring one or two CPUs and ignoring
the rest (using cpu_power).

If several CPUs have small aperf/mperf ratio, it can decide to use fewer
CPUs at a higher aperf/mperf by telling the load balancer not to use
them (cpu_power = 1). All of this is continuously re-adjusted to cope
with changes in the load and hardware variations like turbo boost.

Similarly, if a CPU has aperf/mperf >= 1, it keeps increasing the
P-state (depending on the policy). Once it got to the highest level,
depending on the number of threads in the run-queue (doesn't make sense
for only one), it can open up other CPUs and let the load balancer use
them.

> You can look in hindsight what kind of performance you got (from some basic
> counters in MSRs), and the scheduler can use that to account backwards to what some process
> got. But to predict what you will get in the future...... that's near impossible
> on any realistic system nowadays (and even more so in the future).

We don't need absolute figures matching load to P-states but we'll
continue with an adaptive system. What we have now is also an adaptive
system but with independent decisions taken by the load balancer and the
P-state driver. The load balancer can even get confused by the cpufreq
decisions and move tasks around unnecessarily. With Morten's proposal we
get the power scheduler to adjust the P-state while giving hints to the
load balancer at the same time (it adjusts both, it doesn't try to
re-adjust itself after the load balancer).

> Treating "frequency" (well "performance) and idle separately is also a false thing to do
> (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working
> on fixing that). They are by no means separate things. One guy's idle state
> is the other guys power budget (and thus performance)!.

I agree.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/