linux-kernel - Re: [RFC v5 PATCH 0/8] CFS Hard limits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed628a921002010304p40397558lf45f53daca97e879@mail.gmail.com>
Date:	Mon, 1 Feb 2010 03:04:00 -0800
From:	Paul Turner <pjt@...gle.com>
To:	bharata@...ux.vnet.ibm.com
Cc:	Bharata B Rao <bharata.rao@...il.com>,
	linux-kernel@...r.kernel.org,
	Dhaval Giani <dhaval.giani@...il.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
	Gautham R Shenoy <ego@...ibm.com>,
	Srivatsa Vaddagiri <vatsa@...ibm.com>,
	Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Pavel Emelyanov <xemul@...nvz.org>,
	Herbert Poetzl <herbert@...hfloor.at>,
	Avi Kivity <avi@...hat.com>,
	Chris Friesen <cfriesen@...tel.com>,
	Paul Menage <menage@...gle.com>,
	Mike Waychison <mikew@...gle.com>
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Mon, Feb 1, 2010 at 12:21 AM, Bharata B Rao
<bharata@...ux.vnet.ibm.com> wrote:
> On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
>> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <bharata.rao@...il.com> wrote:
>> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <pjt@...gle.com> wrote:
>> >>
>> >> What are your thoughts on using a separate mechanism for the general case.  A
>> >> draft proposal follows:
>> >>
>> >> - Maintain a global run-time pool for each tg.  The runtime specified by the
>> >>  user represents the value that this pool will be refilled to each period.
>> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
>> >>  continue to accumulate locally here.
>> >>
>> >> Upon locally exceeding the period acquire new credit from the global pool
>> >> (either under lock or more likely using atomic ops).  This can either be in
>> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
>> >> variant with historical demand.
>> >>
>> >> One caveat here is that there is some over-commit in the system, the local
>> >> differences of runtime vs period represent additional over the global pool.
>> >> However it should not be possible to consistently exceed limits since the rate
>> >> of refill is gated by the runtime being input into the system via the per-tg
>> >> pool.
>> >>
>> >
>> > We borrow from what is actually available as spare (spare = unused or
>> > remaining). With global pool, I see that would be difficult.
>> > Inability/difficulty in keeping the global pool in sync with the
>> > actual available spare time is the reason for over-commit ?
>> >
>>
>> We maintain two pools, a global pool (new) and a per-cfs_rq pool
>> (similar to existing rt_bw).
>>
>> When consuming time you charge vs your local bandwidth until it is
>> expired, at this point you must either refill from the global pool, or
>> throttle.
>>
>> The "slack" in the system is the sum of unconsumed time in local pools
>> from the *previous* global pool refill.  This is bounded above by the
>> size of time you refill a local pool at each expiry.  We call the size
>> of refill a 'slice'.
>>
>> e.g.
>>
>> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
>>
>> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
>>
>> When A first executes on each cpu we take slice=10ms from the global
>> pool of 50ms and apply it to the local rq.  Execution then proceeds vs
>> local pool.
>>
>> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
>>
>> Upon period expiration we issue a global pool refill.  At this point we have:
>> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
>>
>> That 10ms of slack time is over-commit in the system.  However it
>> should be clear that this can only be a local effect since over any
>> period of time the rate of input into the system is limited by global
>> pool refill rate.
>
> With the same setup as above consider 5 such tasks which block after
> consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
> period if 5 cpu hogs start running and they would consume this 25ms and the
> 50ms from this period. So we gave 50% extra to a group in a bandwidth period.
> Just wondering how common such scenarious could be.
>

Yes within a single given period you may exceed your reservation due
to slack.  However, of note is that across any 2 successive periods
you are guaranteed to be within your reservation, i.e. 2*usage <=
2*period, as slack available means that you under-consumed your
previous period.

For those needing a hard guarantee (independent of amelioration
strategies) halving the period provided would then provide this across
their target period with the basic v1 implementation.

>>
>> There are also some strategies that we are exploring to improve
>> behavior further here.  One idea is that if we maintain a generation
>> counter then on voluntary dequeue (e.g. tasks block) we can return
>> local time to the global period pool or expire it (if generations
>> don't match), this greatly reduces the noise (via slack vs ideal
>> limit) that a busty application can induce.
>
> Why not clear the remaining runtime during bandwidth refresh ?
>

This doesn't scale with (many cpus) * (many bandwidth limits).

It is still my intuition that introducing a generation counter for
quota would allow us to handle this gracefully as that would allow
both invalidation of expired slack time and for the (potentially
fractional) returning of local quota to the global pool within a
period (upon block/voluntary yield).

Such tactics would probably seem better suited for a v2 as there are
both other possible approaches and the matter of added complexity to
the basic design.  However, if it works well it might sneak into v1 ;)

>>
>> >> This would also naturally associate with an interface change that would mean the
>> >> runtime limit for a group would be the effective cpurate within the period.
>> >>
>> >> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
>> >> allow you to use 2 cpus worth of wall-time on a multicore system.
>> >>
>> >> I feel this is slightly more natural than the current definition which due to
>> >> being local means that values set will not result in consistent behavior across
>> >> machines of different core counts.  It also has the benefit of being consistent
>> >> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
>> >
>> > Though runtimes are enforced locally per-cpu, that's only the
>> > implementation. The definition of runtime and period is still
>> > system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
>> > system wide runtime within a period of 0.5s. Talking about consistent
>> > definition, I would say this consistently defines half of system wide
>> > wall-time on all configurations :)
>>
>> This feels non-intuitive suppose you have a non-homogeneous fleet of
>> systems.  It is also difficult to express limits in terms of cores,
>> suppose I'm an admin trying to jail my users (maybe I rent out virtual
>> time ala EC2 for example).  The fractions I have to use to represent
>> integer core amounts are going to become quite small on large systems.
>>  For example, 1 core on a 64 core system would mean 3.905ms/250ms
>> period.  What's the dependency here between your time and the current
>> cpuset topology also, if I'm only allowed on half the system does this
>> fraction then refer to global resources or what I'm constrained to?
>> This seems a difficult data dependency to manage.
>>
>> My (personal) ideology is that we are metering at the cpu level as
>> opposed to the system level -- which means N seconds of cpu-time makes
>> more sense to me.  I feel it has advantages in that it can be
>> specified more directly relative to the period and is independent of
>> system partitioning.
>>
>> I'd be interested to hear other opinions on this.
>
> We need a consensus here, will wait to see what others think about this.
>
>>
>> > If it means 2 CPUs worth wall-time
>> > in 4 core machine, it would mean 4 CPUs on a 8 CPU machine.  At this
>> > point, I am inclined to go with this and let the admins/tools work out
>> > the actual CPUs part of it. However I would like to hear what others
>> > think about this interface.
>> >
>> >>
>> >> For future scalability as machine size grows this could potentially be
>> >> partitioned below the tg level along the boundaries of sched_domains (or
>> >> something similar).  However for an initial draft given current machine sizes
>> >> the contention on the global pool should hopefully be fairly low.
>> >
>> > One of the alternatives I have in mind is to be more aggressive while
>> > borrowing. While keeping the current algorithm (of iterating thro' all
>> > CPUs when borrowing) intact, we could potentially borrow more from
>> > those CPUs which don't have any running task from the given group. I
>> > just experimented with borrowing half of the available runtime from
>> > such CPUs and found that number of iterations are greatly reduced and
>> > the source runtime quickly converges to its max possible value. Do you
>> > see any issues with this ?
>> >
>>
>> I strongly believe that this is going to induce significant lock
>> contention and is not a scalable solution over time.  While using a
>> faster converging series for time may help I think there are
>> confounding factors that limit effect here.  Consider the 1 core on a
>> 64 core system example above.  With only 3.905ms per pool we are going
>> to quickly hit the case where we are borrowing not-useful periods of
>> time while thrashing locks.
>>
>> We are in the midst of an implementation for proposal above which
>> we'll have ready post to here for consideration next week.  We have
>> maintained your existing approach with respect to handling throttled
>> entities and layered on top of that the proposed alternate
>> local/global bandwidth scheme.  Initial tests show very promising
>> results!
>
> Nice. Look forward to your patches.
>
> Regards,
> Bharata.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/