linux-kernel - Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110908151433.GB6587@linux.vnet.ibm.com>
Date:	Thu, 8 Sep 2011 20:45:07 +0530
From:	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	Paul Turner <pjt@...gle.com>,
	Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>,
	Vladimir Davydov <vdavydov@...allels.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Bharata B Rao <bharata@...ux.vnet.ibm.com>,
	Dhaval Giani <dhaval.giani@...il.com>,
	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>,
	Pavel Emelianov <xemul@...allels.com>
Subject: Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs
 unpinnede

* Peter Zijlstra <a.p.zijlstra@...llo.nl> [2011-09-07 21:22:22]:

> On Wed, 2011-09-07 at 20:50 +0530, Srivatsa Vaddagiri wrote:
> > 
> > Fix excessive idle time reported when cgroups are capped. 
> 
> Where from? The whole idea of bandwidth caps is to introduce idle time,
> so what's excessive and where does it come from?

We have setup cgroups and their hard limits so that in theory they should
consume the entire capacity available on machine, leading to 0% idle time.
That's not what we see. A more detailed description of the setup and the problem
is here:

https://lkml.org/lkml/2011/6/7/352

but to quickly summarize it, the machine and the test-case is as below:

Machine : 16-cpus (2 Quad-core w/ HT enabled)
Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively.
	  Further, each task is placed in its own (sub-)cgroup with 
	  a capped usage of 50% CPU.

	/C1/C1_1/Task1	-> capped at 50% cpu usage
	/C1/C1_2/Task2	-> capped at 50% cpu usage
	/C2/C2_1/Task3	-> capped at 50% cpu usage
	/C2/C2_2/Task3	-> capped at 50% cpu usage
	/C3/C3_1/Task4	-> capped at 50% cpu usage
	/C3/C3_2/Task4	-> capped at 50% cpu usage
	/C3/C3_3/Task4	-> capped at 50% cpu usage
	/C3/C3_4/Task4	-> capped at 50% cpu usage
	...
	/C5/C5_16/Task32 -> capped at 50% cpu usage

So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
system. One can expect 0% idle time in this scenario, which was found
not to be the case. With early versions of cfs hardlimits, upto ~20%
idle time was seen, though with the current version in tip, we see upto
~10% idle time (when cfs.period = 100ms) which goes down to ~5% when
cfs.period is set to 500ms.

>From what I could find out, the "excess" idle time crops up because
load-balancer is not perfect. For example, there are instances when a
CPU has just 1 task on its runqueue (rather then the ideal number of 2
tasks/cpu). When that lone task exceeds its 50% limit, cpu is forced to
become idle.

> >  The patch introduces the notion of "steal" 
> 
> The virt folks already claimed steal-time and have it mean something
> entirely different. You get to pick a new name.

grace time?

> > (or "grace") time which is the surplus
> > time/bandwidth each cgroup is allowed to consume, subject to a maximum
> > steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
> > or "grace" time when the lone task running on a cpu is about to be throttled.
> 
> Ok, so this is a solution to an unstated problem. Why is it a good
> solution?

I am not sure if there are any "good" solutions to this problem! One
possibility is to make the idle load balancer become aggressive in
pulling tasks across sched-domain boundaries i.e when a CPU becomes idle
(after a task got throttled) and invokes the idle load balancer, it
should try "harder" at pulling a task from far-off cpus (across
package/node boundaries)?

> Also, another tunable, yay!

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/