[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xm26fsx9afrk.fsf@google.com>
Date: Tue, 22 Jun 2021 11:57:51 -0700
From: Benjamin Segall <bsegall@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Huaixin Chang <changhuaixin@...ux.alibaba.com>,
luca.abeni@...tannapisa.it, anderson@...unc.edu, baruah@...tl.edu,
dietmar.eggemann@....com, dtcccc@...ux.alibaba.com,
juri.lelli@...hat.com, khlebnikov@...dex-team.ru,
linux-kernel@...r.kernel.org, mgorman@...e.de, mingo@...hat.com,
odin@...d.al, odin@...dal.com, pauld@...head.com, pjt@...gle.com,
rostedt@...dmis.org, shanpeic@...ux.alibaba.com, tj@...nel.org,
tommaso.cucinotta@...tannapisa.it, vincent.guittot@...aro.org,
xiyou.wangcong@...il.com
Subject: Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
Peter Zijlstra <peterz@...radead.org> writes:
> On Mon, Jun 21, 2021 at 05:27:58PM +0800, Huaixin Chang wrote:
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>>
>> We borrow time now against our future underrun, at the cost of increased
>> interference against the other system users. All nicely bounded.
>>
>> Traditional (UP-EDF) bandwidth control is something like:
>>
>> (U = \Sum u_i) <= 1
>>
>> This guaranteeds both that every deadline is met and that the system is
>> stable. After all, if U were > 1, then for every second of walltime,
>> we'd have to run more than a second of program time, and obviously miss
>> our deadline, but the next deadline will be further out still, there is
>> never time to catch up, unbounded fail.
>>
>> This work observes that a workload doesn't always executes the full
>> quota; this enables one to describe u_i as a statistical distribution.
>>
>> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
>> (the traditional WCET). This effectively allows u to be smaller,
>> increasing the efficiency (we can pack more tasks in the system), but at
>> the cost of missing deadlines when all the odds line up. However, it
>> does maintain stability, since every overrun must be paired with an
>> underrun as long as our x is above the average.
>>
>> That is, suppose we have 2 tasks, both specify a p(95) value, then we
>> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
>> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
>> both tasks will exceed their quota at the same time (guaranteed deadline
>> fail). Somewhere in between there's a threshold where one exceeds and
>> the other doesn't underrun enough to compensate; this depends on the
>> specific CDFs.
>>
>> At the same time, we can say that the worst case deadline miss, will be
>> \Sum e_i; that is, there is a bounded tardiness (under the assumption
>> that x+e is indeed WCET).
>>
>> The benefit of burst is seen when testing with schbench. Default value of
>> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>>
>> mkdir /sys/fs/cgroup/cpu/test
>> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>>
>> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>>
>> The average CPU usage is at 80%. I run this for 10 times, and got long tail
>> latency for 6 times and got throttled for 8 times.
>>
>> Tail latencies are shown below, and it wasn't the worst case.
>>
>> Latency percentiles (usec)
>> 50.0000th: 19872
>> 75.0000th: 21344
>> 90.0000th: 22176
>> 95.0000th: 22496
>> *99.0000th: 22752
>> 99.5000th: 22752
>> 99.9000th: 22752
>> min=0, max=22727
>> rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>>
>> The interferenece when using burst is valued by the possibilities for
>> missing the deadline and the average WCET. Test results showed that when
>> there many cgroups or CPU is under utilized, the interference is
>> limited. More details are shown in:
>> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
>>
>> Co-developed-by: Shanpei Chen <shanpeic@...ux.alibaba.com>
>> Signed-off-by: Shanpei Chen <shanpeic@...ux.alibaba.com>
>> Co-developed-by: Tianchen Ding <dtcccc@...ux.alibaba.com>
>> Signed-off-by: Tianchen Ding <dtcccc@...ux.alibaba.com>
>> Signed-off-by: Huaixin Chang <changhuaixin@...ux.alibaba.com>
>> ---
>
> Ben, what say you? I'm tempted to pick up at least this first patch.
Yeah, I'm fine with it; I know internally we've thought about adding
something like this.
Reviewed-by: Ben Segall <bsegall@...gle.com>
Powered by blists - more mailing lists