linux-kernel - [patch v7 0/21] sched: power aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu,  4 Apr 2013 10:00:41 +0800
From:	Alex Shi <alex.shi@...el.com>
To:	mingo@...hat.com, peterz@...radead.org, tglx@...utronix.de,
	akpm@...ux-foundation.org, arjan@...ux.intel.com, bp@...en8.de,
	pjt@...gle.com, namhyung@...nel.org, efault@....de,
	morten.rasmussen@....com
Cc:	vincent.guittot@...aro.org, gregkh@...uxfoundation.org,
	preeti@...ux.vnet.ibm.com, viresh.kumar@...aro.org,
	linux-kernel@...r.kernel.org, alex.shi@...el.com,
	len.brown@...el.com, rafael.j.wysocki@...el.com, jkosina@...e.cz,
	clark.williams@...il.com, tony.luck@...el.com,
	keescook@...omium.org, mgorman@...e.de, riel@...hat.com
Subject: [patch v7 0/21] sched: power aware scheduling

Many many thanks for Namhyung, PJT, Vicent and Preeti's comments and suggestion!
This version included the following changes:
a, remove the patch 3th to recover the runnable load avg recording on rt
b, check avg_idle for each cpu wakeup burst not only the waking CPU.
c, fix select_task_rq_fair return -1 bug by Preeti.

--------------

This patch set implement/consummate the rough power aware scheduling
proposal: https://lkml.org/lkml/2012/8/13/139.

The code also on this git tree:
https://github.com/alexshi/power-scheduling.git power-scheduling

The patch defines a new policy 'powersaving', that try to pack tasks on
each sched groups level. Then it can save much power when task number in
system is no more than LCPU number.

As mentioned in the power aware scheduling proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched groups will reduce cpu power consumption

The first assumption make performance policy take over scheduling when
any group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.

This feature will cause more cpu cores idle, the give more chances to have
cpu freq boost on active cores. CPU freq boost gives better performance and 
better power efficient. The following kbuild test result show this point.

Compare to the removed power balance, this power balance has the following
advantages:
1, simpler sys interface
	only 2 sysfs interface VS 2 interface for each of LCPU
2, cover on all cpu topology 
	effect on all domain level VS only work on SMT/MC domain
3, Less task migration 
	mutual exclusive perf/power LB VS balance power on balanced performance
4, considered system load threshing 
	yes VS no
5, transitory task considered       
	yes VS no

BTW, like sched numa, Power aware scheduling is also a kind of cpu
locality oriented scheduling.

Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
Ingo, Len Brown, Arjan, Borislav Petkov, PJT, Namhyung Kim, Mike
Galbraith, Greg, Preeti, Morten Rasmussen, Rafael etc.

Since the patch can perfect pack tasks into fewer groups, I just show
some performance/power testing data here:
=========================================
$for ((i = 0; i < x; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4 core* HT: the data is avg Watts
         powersaving     performance
x = 8	 72.9482 	 72.6702
x = 4	 61.2737 	 66.7649
x = 2	 44.8491 	 59.0679
x = 1	 43.225 	 43.0638

on SNB EP machine with 2 sockets * 8 cores * HT:
         powersaving     performance
x = 32	 393.062 	 395.134
x = 16	 277.438 	 376.152
x = 8	 209.33 	 272.398
x = 4	 199 	         238.309
x = 2	 175.245 	 210.739
x = 1	 174.264 	 173.603


tasks number keep waving benchmark, 'make -j <x> vmlinux'
on my SNB EP 2 sockets machine with 8 cores * HT:
         powersaving              performance
x = 2    189.416 /228 23          193.355 /209 24
x = 4    215.728 /132 35          219.69 /122 37
x = 8    244.31 /75 54            252.709 /68 58
x = 16   299.915 /43 77           259.127 /58 66
x = 32   341.221 /35 83           323.418 /38 81

data explains: 189.416 /228 23
	189.416: average Watts during compilation
	228: seconds(compile time)
	23:  scaled performance/watts = 1000000 / seconds / watts
The performance value of kbuild is better on threads 16/32, that's due
to lazy power balance reduced the context switch and CPU has more boost 
chance on powersaving balance.

Some performance testing results:
---------------------------------

Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms.

results:
A, no clear performance change found on 'performance' policy.
B, specjbb2005 drop 5~7% on both of policy whenever with openjdk or
   jrockit on powersaving polocy
C, hackbench drops 40% with powersaving policy on snb 4 sockets platforms.
Others has no clear change.

===
Changelog:
V7 change:
a, remove the patch 3th to recover the runnable load avg recording on rt
b, check avg_idle for each cpu wakeup burst not only the waking CPU.
c, fix select_task_rq_fair return -1 bug by Preeti.

Changelog:
V6 change:
a, remove 'balance' policy.
b, consider RT task effect in balancing
c, use avg_idle as burst wakeup indicator
d, balance on task utilization in fork/exec/wakeup.
e, no power balancing on SMT domain.

V5 change:
a, change sched_policy to sched_balance_policy
b, split fork/exec/wake power balancing into 3 patches and refresh
commit logs
c, others minors clean up

V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten Rasmussen's suggestion to use different criteria for
different policy in transitory task packing.
c, shorter latency in power aware scheduling.

V3 change:
a, engaged nr_running and utilisation in periodic power balancing.
b, try packing small exec/wake tasks on running cpu not idle cpu.

V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.


-- Thanks Alex
[patch v7 01/21] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v7 02/21] sched: set initial value of runnable avg for new
[patch v7 03/21] sched: add sched balance policies in kernel
[patch v7 04/21] sched: add sysfs interface for sched_balance_policy
[patch v7 05/21] sched: log the cpu utilization at rq
[patch v7 06/21] sched: add new sg/sd_lb_stats fields for incoming
[patch v7 07/21] sched: move sg/sd_lb_stats struct ahead
[patch v7 08/21] sched: scale_rt_power rename and meaning change
[patch v7 09/21] sched: get rq potential maximum utilization
[patch v7 10/21] sched: add power aware scheduling in fork/exec/wake
[patch v7 11/21] sched: add sched_burst_threshold_ns as wakeup burst
[patch v7 12/21] sched: using avg_idle to detect bursty wakeup
[patch v7 13/21] sched: packing transitory tasks in wakeup power
[patch v7 14/21] sched: add power/performance balance allow flag
[patch v7 15/21] sched: pull all tasks from source group
[patch v7 16/21] sched: no balance for prefer_sibling in power
[patch v7 17/21] sched: add new members of sd_lb_stats
[patch v7 18/21] sched: power aware load balance
[patch v7 19/21] sched: lazy power balance
[patch v7 20/21] sched: don't do power balance on share cpu power
[patch v7 21/21] sched: make sure select_tas_rq_fair get a cpu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/