linux-kernel - Re: [RFC v5 PATCH 0/8] CFS Hard limits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100201082150.GB686@in.ibm.com>
Date:	Mon, 1 Feb 2010 13:51:50 +0530
From:	Bharata B Rao <bharata@...ux.vnet.ibm.com>
To:	Paul Turner <pjt@...gle.com>
Cc:	Bharata B Rao <bharata.rao@...il.com>,
	linux-kernel@...r.kernel.org,
	Dhaval Giani <dhaval.giani@...il.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
	Gautham R Shenoy <ego@...ibm.com>,
	Srivatsa Vaddagiri <vatsa@...ibm.com>,
	Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Pavel Emelyanov <xemul@...nvz.org>,
	Herbert Poetzl <herbert@...hfloor.at>,
	Avi Kivity <avi@...hat.com>,
	Chris Friesen <cfriesen@...tel.com>,
	Paul Menage <menage@...gle.com>,
	Mike Waychison <mikew@...gle.com>
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <bharata.rao@...il.com> wrote:
> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <pjt@...gle.com> wrote:
> >>
> >> What are your thoughts on using a separate mechanism for the general case.  A
> >> draft proposal follows:
> >>
> >> - Maintain a global run-time pool for each tg.  The runtime specified by the
> >>  user represents the value that this pool will be refilled to each period.
> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
> >>  continue to accumulate locally here.
> >>
> >> Upon locally exceeding the period acquire new credit from the global pool
> >> (either under lock or more likely using atomic ops).  This can either be in
> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
> >> variant with historical demand.
> >>
> >> One caveat here is that there is some over-commit in the system, the local
> >> differences of runtime vs period represent additional over the global pool.
> >> However it should not be possible to consistently exceed limits since the rate
> >> of refill is gated by the runtime being input into the system via the per-tg
> >> pool.
> >>
> >
> > We borrow from what is actually available as spare (spare = unused or
> > remaining). With global pool, I see that would be difficult.
> > Inability/difficulty in keeping the global pool in sync with the
> > actual available spare time is the reason for over-commit ?
> >
> 
> We maintain two pools, a global pool (new) and a per-cfs_rq pool
> (similar to existing rt_bw).
> 
> When consuming time you charge vs your local bandwidth until it is
> expired, at this point you must either refill from the global pool, or
> throttle.
> 
> The "slack" in the system is the sum of unconsumed time in local pools
> from the *previous* global pool refill.  This is bounded above by the
> size of time you refill a local pool at each expiry.  We call the size
> of refill a 'slice'.
> 
> e.g.
> 
> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
> 
> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
> 
> When A first executes on each cpu we take slice=10ms from the global
> pool of 50ms and apply it to the local rq.  Execution then proceeds vs
> local pool.
> 
> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
> 
> Upon period expiration we issue a global pool refill.  At this point we have:
> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
> 
> That 10ms of slack time is over-commit in the system.  However it
> should be clear that this can only be a local effect since over any
> period of time the rate of input into the system is limited by global
> pool refill rate.

With the same setup as above consider 5 such tasks which block after
consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
period if 5 cpu hogs start running and they would consume this 25ms and the
50ms from this period. So we gave 50% extra to a group in a bandwidth period.
Just wondering how common such scenarious could be.

> 
> There are also some strategies that we are exploring to improve
> behavior further here.  One idea is that if we maintain a generation
> counter then on voluntary dequeue (e.g. tasks block) we can return
> local time to the global period pool or expire it (if generations
> don't match), this greatly reduces the noise (via slack vs ideal
> limit) that a busty application can induce.

Why not clear the remaining runtime during bandwidth refresh ?

> 
> >> This would also naturally associate with an interface change that would mean the
> >> runtime limit for a group would be the effective cpurate within the period.
> >>
> >> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
> >> allow you to use 2 cpus worth of wall-time on a multicore system.
> >>
> >> I feel this is slightly more natural than the current definition which due to
> >> being local means that values set will not result in consistent behavior across
> >> machines of different core counts.  It also has the benefit of being consistent
> >> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
> >
> > Though runtimes are enforced locally per-cpu, that's only the
> > implementation. The definition of runtime and period is still
> > system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
> > system wide runtime within a period of 0.5s. Talking about consistent
> > definition, I would say this consistently defines half of system wide
> > wall-time on all configurations :)
> 
> This feels non-intuitive suppose you have a non-homogeneous fleet of
> systems.  It is also difficult to express limits in terms of cores,
> suppose I'm an admin trying to jail my users (maybe I rent out virtual
> time ala EC2 for example).  The fractions I have to use to represent
> integer core amounts are going to become quite small on large systems.
>  For example, 1 core on a 64 core system would mean 3.905ms/250ms
> period.  What's the dependency here between your time and the current
> cpuset topology also, if I'm only allowed on half the system does this
> fraction then refer to global resources or what I'm constrained to?
> This seems a difficult data dependency to manage.
> 
> My (personal) ideology is that we are metering at the cpu level as
> opposed to the system level -- which means N seconds of cpu-time makes
> more sense to me.  I feel it has advantages in that it can be
> specified more directly relative to the period and is independent of
> system partitioning.
> 
> I'd be interested to hear other opinions on this.

We need a consensus here, will wait to see what others think about this.

> 
> > If it means 2 CPUs worth wall-time
> > in 4 core machine, it would mean 4 CPUs on a 8 CPU machine.  At this
> > point, I am inclined to go with this and let the admins/tools work out
> > the actual CPUs part of it. However I would like to hear what others
> > think about this interface.
> >
> >>
> >> For future scalability as machine size grows this could potentially be
> >> partitioned below the tg level along the boundaries of sched_domains (or
> >> something similar).  However for an initial draft given current machine sizes
> >> the contention on the global pool should hopefully be fairly low.
> >
> > One of the alternatives I have in mind is to be more aggressive while
> > borrowing. While keeping the current algorithm (of iterating thro' all
> > CPUs when borrowing) intact, we could potentially borrow more from
> > those CPUs which don't have any running task from the given group. I
> > just experimented with borrowing half of the available runtime from
> > such CPUs and found that number of iterations are greatly reduced and
> > the source runtime quickly converges to its max possible value. Do you
> > see any issues with this ?
> >
> 
> I strongly believe that this is going to induce significant lock
> contention and is not a scalable solution over time.  While using a
> faster converging series for time may help I think there are
> confounding factors that limit effect here.  Consider the 1 core on a
> 64 core system example above.  With only 3.905ms per pool we are going
> to quickly hit the case where we are borrowing not-useful periods of
> time while thrashing locks.
> 
> We are in the midst of an implementation for proposal above which
> we'll have ready post to here for consideration next week.  We have
> maintained your existing approach with respect to handling throttled
> entities and layered on top of that the proposed alternate
> local/global bandwidth scheme.  Initial tests show very promising
> results!

Nice. Look forward to your patches.

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/