[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1298637080.2428.1751.camel@twins>
Date: Fri, 25 Feb 2011 13:31:20 +0100
From: Peter Zijlstra <a.p.zijlstra@...llo.nl>
To: Paul Turner <pjt@...gle.com>
Cc: linux-kernel@...r.kernel.org,
Bharata B Rao <bharata@...ux.vnet.ibm.com>,
Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
Balbir Singh <balbir@...ux.vnet.ibm.com>,
Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
Gautham R Shenoy <ego@...ibm.com>,
Srivatsa Vaddagiri <vatsa@...ibm.com>,
Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>,
Ingo Molnar <mingo@...e.hu>,
Pavel Emelyanov <xemul@...nvz.org>,
Herbert Poetzl <herbert@...hfloor.at>,
Avi Kivity <avi@...hat.com>,
Chris Friesen <cfriesen@...tel.com>,
Nikhil Rao <ncrao@...gle.com>
Subject: Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq
cpu usage
On Thu, 2011-02-24 at 19:33 -0800, Paul Turner wrote:
> >> #ifdef CONFIG_CFS_BANDWIDTH
> >> +static u64 tg_request_cfs_quota(struct task_group *tg)
> >> +{
> >> + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> >> + u64 delta = 0;
> >> +
> >> + if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> >> + raw_spin_lock(&cfs_b->lock);
> >> + /*
> >> + * it's possible a bandwidth update has changed the global
> >> + * pool.
> >> + */
> >> + if (cfs_b->quota == RUNTIME_INF)
> >> + delta = sched_cfs_bandwidth_slice();
> >
> > Why do we bother at all when there's infinite time? Shouldn't the action
> > that sets it to infinite also make cfs_rq->quota_assinged to to
> > RUNTIME_INF, in which case the below check will make it all go away?
> But more importantly we don't want to decrement the value doled out
> FROM cfs_b->runtime since that would change it from the magic
> RUNTIME_INF. That's why the check exists.
Ah, quite so.
> >> + else {
> >> + delta = min(cfs_b->runtime,
> >> + sched_cfs_bandwidth_slice());
> >> + cfs_b->runtime -= delta;
> >> + }
> >> + raw_spin_unlock(&cfs_b->lock);
> >> + }
> >> + return delta;
> >> +}
> >
> > Also, shouldn't this all try and steal time from other cpus when the
> > global limit ran out? Suppose you have say 24 cpus, and you had a short
> > burst where all 24 cpus had some runtime, so you distribute 240ms. But
> > 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> > cpus while your one active cpu will then throttle.
> >
>
> In practice this only affects the first period since that slightly
> stale bandwidth is then available on those other 23 cpus the next time
> a micro-burst occurs. In testing this has resulted in very stable
> performance and "smooth" perturbations that resemble hardcapping by
> affinity (for integer points).
>
> The notion of stealing could certainly be introduced, the juncture of
> reaching the zero point here would be a possible place to consider
> that (although we would need to do a steal that avoids the asymptotic
> convergence problem of RT).
>
> I think returning (most) of the bandwidth to the global pool on
> (voluntary) dequeue is a more scalable solution
>
> > I would much rather see all the accounting tight first and optimize
> > later than start with leaky stuff and try and plug holes later.
> >
>
> The complexity this introduces is non-trivial since the idea of
> returning quota to the pool means you need to introduce the notion of
> when that quota came to life (otherwise you get leaks in the reverse
> direction!) -- as well as reversing some of the lock ordering.
Right, nasty that, RT doesn't suffer this because of the lack of
over-commit.
> While generations do this they don't greatly increase the efficacy and
> I think there is value in performing the detailed review we are doing
> now in isolation of that.
>
> It's also still consistent regarding leakage since in that in any N
> consecutive periods the maximum additional quota (by a user abusing
> this) that can be received is N+1. Does the complexity trade-off
> merit improving this bound at this point?
Well, something yes, with N being potentially very large indeed these
days we need some feedback.
One idea would be to keep a cpu mask in the bandwidth thing and setting
the cpu when a cpu claims bandwidth from the global pool and iterate and
clear the complete mask on tick.
That also limits the scope on where to look for stealing time.
> >> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >> + unsigned long delta_exec)
> >> +{
> >> + if (cfs_rq->quota_assigned == RUNTIME_INF)
> >> + return;
> >> +
> >> + cfs_rq->quota_used += delta_exec;
> >> +
> >> + if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> >> + return;
> >> +
> >> + cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> >> +}
> >
> > So why isn't this hierarchical?,
>
> It is naturally (since charging occurs within the existing hierarchal
> accounting)
D'0h yes.. somehow I totally missed that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists