linux-kernel - [RFC v2 PATCH 0/8] CFS Hard limits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090930124919.GA19951@in.ibm.com>
Date:	Wed, 30 Sep 2009 18:19:20 +0530
From:	Bharata B Rao <bharata@...ux.vnet.ibm.com>
To:	linux-kernel@...r.kernel.org
Cc:	Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
	Gautham R Shenoy <ego@...ibm.com>,
	Srivatsa Vaddagiri <vatsa@...ibm.com>,
	Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Pavel Emelyanov <xemul@...nvz.org>,
	Herbert Poetzl <herbert@...hfloor.at>,
	Avi Kivity <avi@...hat.com>,
	Chris Friesen <cfriesen@...tel.com>,
	Paul Menage <menage@...gle.com>,
	Mike Waychison <mikew@...gle.com>
Subject: [RFC v2 PATCH 0/8] CFS Hard limits - v2

Hi,

Here is the v2 post of hard limits feature for CFS group scheduler. This
RFC post mainly adds runtime borrowing feature and has a new locking scheme
to protect CFS runtime related fields.

It would be nice to have some comments on this set!

Changes
-------

RFC v2:
- Upgraded to 2.6.31.
- Added CFS runtime borrowing.
- New locking scheme
    The hard limit specific fields of cfs_rq (cfs_runtime, cfs_time and
    cfs_throttled) were being protected by rq->lock. This simple scheme will
    not work when runtime rebalancing is introduced where it will be required
    to look at these fields on other CPU's which requires us to acquire
    rq->lock of other CPUs. This will not be feasible from update_curr().
    Hence introduce a separate lock (rq->runtime_lock) to protect these
    fields of all cfs_rq under it.
- Handle the task wakeup in a throttled group correctly.
- Make CFS_HARD_LIMITS dependent on CGROUP_SCHED (Thanks to Andrea Righi)

RFC v1:
- First version of the patches with minimal features was posted at
  http://lkml.org/lkml/2009/8/25/128

RFC v0:
- The CFS hard limits proposal was first posted at
  http://lkml.org/lkml/2009/6/4/24

Testing and Benchmark numbers
-----------------------------
- This patchset has seen very minimal testing on 24way machine and is expected
  to have bugs. I need to test this under more test scenarios.
- I have run a few common benchmarks to see if my patches introduce any visible
  overhead. I am aware that the number of runs or the combinations I have
  used may not be ideal, but the intention in this early stage is to catch any
  serious regressions that the patches would have introduced.
- I plan to get numbers from more benchmarks in future releases. Any inputs
  on specific benchmarks to try would be helpful.

- hackbench (hackbench -pipe N)
  (hackbench was run as part of a group under root group)
  -----------------------------------------------------------------------
				Time
	-----------------------------------------------------------------
  N	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
  -----------------------------------------------------------------------
  10	0.475			0.384			0.253
  20	0.610			0.670			0.692
  50	1.250			1.201			1.295
  100	1.981			2.174			1.583
  -----------------------------------------------------------------------
  - BW = Bandwidth = runtime/period
  - Infinite runtime means no hard limiting

- lmbench (lat_ctx -N 5 -s <size_in_kb> N)

  (i) size_in_kb = 1024
  -----------------------------------------------------------------------
				Context switch time (us)
	-----------------------------------------------------------------
  N	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
  -----------------------------------------------------------------------
  10	315.87			330.19			317.04
  100	675.52			699.90			698.50
  500	775.01			772.86			772.30
  -----------------------------------------------------------------------

  (ii) size_in_kb = 2048
  -----------------------------------------------------------------------
				Context switch time (us)
	-----------------------------------------------------------------
  N	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
  -----------------------------------------------------------------------
  10	1319.01			1332.16			1328.09
  100	1400.77			1372.67			1382.27
  500	1479.40			1524.57			1615.84
  -----------------------------------------------------------------------

- kernbench

Average Half load -j 12 Run (std deviation):
------------------------------------------------------------------------------
	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
------------------------------------------------------------------------------
Elapsd	5.716 (0.278711)	6.06 (0.479322)		5.41 (0.360694)
User	20.464 (2.22087)	22.978 (3.43738)	18.486 (2.60754)
System	14.82 (1.52086)		16.68 (2.3438)		13.514 (1.77074)
% CPU	615.2 (41.1667)		651.6 (43.397)		588.4 (42.0214)
CtxSwt	2727.8 (243.19)		3030.6 (425.338)	2536 (302.498)
Sleeps	4981.4 (442.337)	5532.2 (847.27)		4554.6 (510.532)
------------------------------------------------------------------------------

Average Optimal load -j 96 Run (std deviation):
------------------------------------------------------------------------------
	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
------------------------------------------------------------------------------
Elapsd	4.826 (0.276641)	4.776 (0.291599)	5.13 (0.50448)
User	21.278 (2.67999)	22.138 (3.2045)		21.988 (5.63116)
System	19.213 (5.38314)	19.796 (4.32574)	20.407 (8.53682)
% CPU	778.3 (184.522)		786.1 (154.295)		803.1 (244.865)
CtxSwt	2906.5 (387.799)	3052.1 (397.15)		3030.6 (765.418)
Sleeps	4576.6 (565.383)	4796 (990.278)		4576.9 (625.933)
------------------------------------------------------------------------------

Average Maximal load -j Run (std deviation):
------------------------------------------------------------------------------
	CFS_HARD_LIMTS=n	CFS_HARD_LIMTS=y	CFS_HARD_LIMITS=y
				(infinite runtime)	(BW=450000/500000)
------------------------------------------------------------------------------
Elapsd	5.13 (0.530236)		5.062 (0.0408656)	4.94 (0.229891)
User	22.7293 (4.37921)	22.9973 (2.86311)	22.5507 (4.78016)
System	21.966 (6.81872)	21.9713 (4.72952)	22.0287 (7.39655)
% CPU	860 (202.295)		859.8 (164.415)		864.467 (218.721)
CtxSwt	3154.27 (659.933)	3172.93 (370.439)	3127.2 (657.224)
Sleeps	4602.6 (662.155)	4676.67 (813.274)	4489.2 (542.859)
------------------------------------------------------------------------------

Features TODO
-------------
- CFS runtime borrowing still needs some work, especially need to handle
  runtime redistribution when a CPU goes offline.
- Bandwidth inheritance support (long term, not under consideration currently)
- This implementation doesn't work for user group scheduler. Since user group
  scheduler will eventually go away, I don't plan to work on this.

Implementation TODO
-------------------
- It is possible to share some of the bandwidth handling code with RT, but
  the intention of this post is to show the changes associated with hard limits.
  Hence the sharing/cleanup will be done down the line when this patchset
  itself becomes more accepatable.
- When a dequeued entity is enqueued back, I don't change its vruntime. The
  entity might get undue advantage due to its old (lower) vruntime. Need to
  address this.

Patches description
-------------------
This post has the following patches:

1/8 sched: Rename sched_rt_period_mask() and use it in CFS also
2/8 sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level
3/8 sched: Bandwidth initialization for fair task groups
4/8 sched: Enforce hard limits by throttling
5/8 sched: Unthrottle the throttled tasks
6/8 sched: Add throttle time statistics to /proc/sched_debug
7/8 sched: CFS runtime borrowing
8/8 sched: Hard limits documentation

 Documentation/scheduler/sched-cfs-hard-limits.txt |   52 ++
 include/linux/sched.h                               |    9 
 init/Kconfig                                        |   13 
 kernel/sched.c                                      |  427 +++++++++++++++++++
 kernel/sched_debug.c                                |   21 
 kernel/sched_fair.c                                 |  432 +++++++++++++++++++-
 kernel/sched_rt.c                                   |   22 -
 7 files changed, 932 insertions(+), 44 deletions(-)

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/