[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <67876463-727c-3381-abec-196f9d8462c7@yandex-team.ru>
Date: Wed, 13 Nov 2019 11:33:14 +0300
From: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To: Dave Chiluk <chiluk+linux@...eed.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Ben Segall <bsegall@...gle.com>,
Ingo Molnar <mingo@...hat.com>, Tejun Heo <tj@...nel.org>,
cgroups@...r.kernel.org, Cong Wang <xiyou.wangcong@...il.com>
Subject: Re: [PATCH RFC] sched/fair: add burst to cgroup cpu bandwidth
controller
On 13/11/2019 10.35, Dave Chiluk wrote:
> I was going to write this exact patch, but you beat me to it. I like
> that it's sequestered into a separate setting. This would be very
> useful for interactive applications that have cfs bandwidth control
> enabled *(cough* kubernetes). Are you planning on submitting the
> plumbing for this feature back into docker/containerd/lxc/other
> container orchestrators?
I haven't thought about that yet.
Burst size should have reasonable default, like quota for 4 periods.
So, per-container configuration might be redundant.
>
> Also this is very similar in desired behavior to a patchset submitted
> by Cong Wang https://lore.kernel.org/patchwork/patch/907450/
> .Although, I like this implementation better due to it's simplicity.
> I guess a lot of the simplification comes from not having to deal with
> the expiration logic.
>
> Additional comments inlined. Documentation edits, and my preference
> for a full burst bank on setting.
>
> Otherwise looks good to me.
> + Reviewed-by: Dave Chiluk <chiluk+linux@...eed.com>
>
> I also tested this with my simple test application I use for
> validating runtime accuracy https://github.com/indeedeng/fibtest. And
> everything looks reasonable here. I'll try to do a bit more testing
> but so far I'm happy. Hopefully Ben will give a +1 to this as well.
I've loaded it with fio, something like:
fio --name=test --ioengine=cpuio --time_based=1 --runtime=600 --cpuload=10 --cpuchunks=100000 --numjobs=10
Statistics in cpu.stat allows to see behaviour even without feedback from workload.
Precise testing isn't trivial though - per-cpu slices adds much noise.
>
> Thanks!
>
> On Thu, Nov 7, 2019 at 6:12 AM Konstantin Khlebnikov
> <khlebnikov@...dex-team.ru> wrote:
>>
>> Currently CFS bandwidth controller assigns cpu.cfs_quota_us runtime into
>> global pool every cpu.cfs_period_us. All unused runtime is expired.
>>
>> Since commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
>> throttling by removing expiration of cpu-local slices") slice assigned
>> to cpu does not expire. This allows to serve tiny bursts (upto 1ms),
>> but this runtime pool is cpu-bound and not transferred between cpus.
>>
>> Setup for interactive workload with irregular cpu consumption have to set
>> quota according to relatively short spikes of cpu usage. This eliminates
>> possibility of control for average cpu usage. Increasing period and quota
>> proportionally for getting bigger runtime chunks is not an option because
>> if even bigger spike deplete global pool then execution will stuck until
>> end of period and next refill.
>>
>> This patch adds limited accumulation of unused runtime from past periods.
>> Accumulated runtime does not expire. It stays in global pool and could
>> be used by any cpu. Average cpu usage stays limited with quota / period,
>> but spiky workload could use more cpu power for a short period of time.
>>
>> Size of pool for burst runtime is set in attribute cpu.cfs_burst_us.
>> Default is 0, which reflects current behavior.
>>
>> Statistics for used bust runtime is shown in cpu.stat as "burst_time".
>>
>> Example setup:
>> cpu.cfs_period_us = 100ms
>> cpu.cfs_quota_us = 200ms
>> cpu.cfs_burst_us = 300ms
>>
>> Average cpu usage stays limited with 2 cpus (quota / period), but cgroup
>> could accumulate runtime (burst) and for 100ms could utilize up to 5 cpus
>> (quota / period + burst / 100ms), or 3 cpus for 300ms, an so on.
>>
>> Implementation is simple. All logic is in __refill_cfs_bandwidth_runtime().
>> The rest changes are interface for cgroup and cgroup2.
>>
>> For cgroup2 burst is set as third number in attribute cpu.max:
>> cpu.max = $QUOTA $PERIOD $BURST
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
>> ---
>> Documentation/admin-guide/cgroup-v2.rst | 15 +++--
>> Documentation/scheduler/sched-bwc.rst | 8 ++-
>> kernel/sched/core.c | 88 +++++++++++++++++++++++++------
>> kernel/sched/fair.c | 34 ++++++++++--
>> kernel/sched/sched.h | 4 +
>> 5 files changed, 119 insertions(+), 30 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 5361ebec3361..8c3cc3d882ba 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -981,11 +981,12 @@ All time durations are in microseconds.
>> - user_usec
>> - system_usec
>>
>> - and the following three when the controller is enabled:
>> + and the following four when the controller is enabled:
>>
>> - nr_periods
>> - nr_throttled
>> - throttled_usec
>> + - burst_usec
>>
>> cpu.weight
>> A read-write single value file which exists on non-root
>> @@ -1006,16 +1007,18 @@ All time durations are in microseconds.
>> the closest approximation of the current weight.
>>
>> cpu.max
>> - A read-write two value file which exists on non-root cgroups.
>> - The default is "max 100000".
>> + A read-write 1..3 values file which exists on non-root cgroups.
>> + The default is "max 100000 0".
>>
>> The maximum bandwidth limit. It's in the following format::
>>
>> - $MAX $PERIOD
>> + $MAX $PERIOD $BURST
>>
>> which indicates that the group may consume upto $MAX in each
>> - $PERIOD duration. "max" for $MAX indicates no limit. If only
>> - one number is written, $MAX is updated.
>> + $PERIOD duration and accumulates upto $BURST time for bursts.
>> +
>> + "max" for $MAX indicates no limit.
>> + If only one number is written, $MAX is updated.
>>
>> cpu.pressure
>> A read-only nested-key file which exists on non-root cgroups.
>> diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
>> index 9801d6b284b1..5f1cac7b2d44 100644
>> --- a/Documentation/scheduler/sched-bwc.rst
>> +++ b/Documentation/scheduler/sched-bwc.rst
>> @@ -27,12 +27,14 @@ Quota and period are managed within the cpu subsystem via cgroupfs.
>>
>> cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
>> cpu.cfs_period_us: the length of a period (in microseconds)
>> +cpu.cfs_burst_us: the maxumum size of burst run-time pool (in microseconds)
>
> maxumum -> maximum
>
>> cpu.stat: exports throttling statistics [explained further below]
>>
>> The default values are::
>>
>> cpu.cfs_period_us=100ms
>> - cpu.cfs_quota=-1
>> + cpu.cfs_quota_us=-1
>> + cpu.cfs_burst_us=0
>>
>> A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
>> bandwidth restriction in place, such a group is described as an unconstrained
>> @@ -51,6 +53,9 @@ and return the group to an unconstrained state once more.
>> Any updates to a group's bandwidth specification will result in it becoming
>> unthrottled if it is in a constrained state.
>>
>> +Writing positive value into cpu.cfs_burst_us allows to allcumulate unused
>> +run-time (upto this value) and use it later in addition to assigned quota.
> Writing positive value into cpu.cfs_burst_us allows unused quota to
> accumulate up to this value and be used later in addition to assigned
> quota.
>
>> +
>> System wide settings
>> --------------------
>> For efficiency run-time is transferred between the global pool and CPU local
>> @@ -75,6 +80,7 @@ cpu.stat:
>> - nr_throttled: Number of times the group has been throttled/limited.
>> - throttled_time: The total time duration (in nanoseconds) for which entities
>> of the group have been throttled.
>> +- burst_time: The total running time consumed from burst pool.
>>
>> This interface is read-only.
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index dd05a378631a..010b79c45200 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7354,7 +7354,8 @@ static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
>>
>> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>>
>> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>> + u64 burst)
>> {
>> int i, ret = 0, runtime_enabled, runtime_was_enabled;
>> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>> @@ -7399,12 +7400,13 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> raw_spin_lock_irq(&cfs_b->lock);
>> cfs_b->period = ns_to_ktime(period);
>> cfs_b->quota = quota;
>> -
>> - __refill_cfs_bandwidth_runtime(cfs_b);
>> + cfs_b->burst = burst;
>
> + cfs_b->burst_runtime = burst;
> I think we should default to a full burst bank. After all, during
> startup applications very likely to need additional CPU. This is a
> regular pain point for our applications.. By providing burst bank
> early we help to mitigate slow start-up times for cpu limited
> applications.
Makes sense.
>
>>
>> /* Restart the period timer (if active) to handle new period expiry: */
>> - if (runtime_enabled)
>> + if (runtime_enabled) {
>> + __refill_cfs_bandwidth_runtime(cfs_b);
>> start_cfs_bandwidth(cfs_b);
>> + }
>>
>> raw_spin_unlock_irq(&cfs_b->lock);
>>
>> @@ -7432,9 +7434,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>>
>> static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>> {
>> - u64 quota, period;
>> + u64 quota, period, burst;
>>
>> period = ktime_to_ns(tg->cfs_bandwidth.period);
>> + burst = tg->cfs_bandwidth.burst;
>> if (cfs_quota_us < 0)
>> quota = RUNTIME_INF;
>> else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
>> @@ -7442,7 +7445,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>> else
>> return -EINVAL;
>>
>> - return tg_set_cfs_bandwidth(tg, period, quota);
>> + return tg_set_cfs_bandwidth(tg, period, quota, burst);
>> }
>>
>> static long tg_get_cfs_quota(struct task_group *tg)
>> @@ -7460,15 +7463,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
>>
>> static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
>> {
>> - u64 quota, period;
>> + u64 quota, period, burst;
>>
>> if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
>> return -EINVAL;
>>
>> period = (u64)cfs_period_us * NSEC_PER_USEC;
>> quota = tg->cfs_bandwidth.quota;
>> + burst = tg->cfs_bandwidth.burst;
>>
>> - return tg_set_cfs_bandwidth(tg, period, quota);
>> + return tg_set_cfs_bandwidth(tg, period, quota, burst);
>> }
>>
>> static long tg_get_cfs_period(struct task_group *tg)
>> @@ -7481,6 +7485,28 @@ static long tg_get_cfs_period(struct task_group *tg)
>> return cfs_period_us;
>> }
>>
>> +static long tg_get_cfs_burst(struct task_group *tg)
>> +{
>> + u64 cfs_burst_us = tg->cfs_bandwidth.burst;
>> +
>> + do_div(cfs_burst_us, NSEC_PER_USEC);
>> + return cfs_burst_us;
>> +}
>> +
>> +static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
>> +{
>> + u64 quota, period, burst;
>> +
>> + if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC)
>> + return -EINVAL;
>> +
>> + period = ktime_to_ns(tg->cfs_bandwidth.period);
>> + quota = tg->cfs_bandwidth.quota;
>> + burst = (u64)cfs_burst_us * NSEC_PER_USEC;
>> +
>> + return tg_set_cfs_bandwidth(tg, period, quota, burst);
>> +}
>> +
>> static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
>> struct cftype *cft)
>> {
>> @@ -7505,6 +7531,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
>> return tg_set_cfs_period(css_tg(css), cfs_period_us);
>> }
>>
>> +static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css,
>> + struct cftype *cft)
>> +{
>> + return tg_get_cfs_burst(css_tg(css));
>> +}
>> +
>> +static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css,
>> + struct cftype *cftype, u64 cfs_burst_us)
>> +{
>> + return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
>> +}
>> +
>> struct cfs_schedulable_data {
>> struct task_group *tg;
>> u64 period, quota;
>> @@ -7596,6 +7634,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>> seq_printf(sf, "nr_periods %d\n", cfs_b->nr_periods);
>> seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
>> seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>> + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
>>
>> if (schedstat_enabled() && tg != &root_task_group) {
>> u64 ws = 0;
>> @@ -7657,6 +7696,11 @@ static struct cftype cpu_legacy_files[] = {
>> .read_u64 = cpu_cfs_period_read_u64,
>> .write_u64 = cpu_cfs_period_write_u64,
>> },
>> + {
>> + .name = "cfs_burst_us",
>> + .read_u64 = cpu_cfs_burst_read_u64,
>> + .write_u64 = cpu_cfs_burst_write_u64,
>> + },
>> {
>> .name = "stat",
>> .seq_show = cpu_cfs_stat_show,
>> @@ -7699,15 +7743,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
>> struct task_group *tg = css_tg(css);
>> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>> u64 throttled_usec;
>> + u64 burst_usec;
>>
>> throttled_usec = cfs_b->throttled_time;
>> do_div(throttled_usec, NSEC_PER_USEC);
>>
>> + burst_usec = cfs_b->burst_time;
>> + do_div(burst_usec, NSEC_PER_USEC);
>> +
>> seq_printf(sf, "nr_periods %d\n"
>> "nr_throttled %d\n"
>> - "throttled_usec %llu\n",
>> + "throttled_usec %llu\n"
>> + "burst_usec %llu\n",
>> cfs_b->nr_periods, cfs_b->nr_throttled,
>> - throttled_usec);
>> + throttled_usec, burst_usec);
>> }
>> #endif
>> return 0;
>> @@ -7777,26 +7826,29 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
>> #endif
>>
>> static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
>> - long period, long quota)
>> + long period, long quota,
>> + long burst)
>> {
>> if (quota < 0)
>> seq_puts(sf, "max");
>> else
>> seq_printf(sf, "%ld", quota);
>>
>> - seq_printf(sf, " %ld\n", period);
>> + seq_printf(sf, " %ld %ld\n", period, burst);
>> }
>>
>> /* caller should put the current value in *@...iodp before calling */
>> static int __maybe_unused cpu_period_quota_parse(char *buf,
>> - u64 *periodp, u64 *quotap)
>> + u64 *periodp, u64 *quotap,
>> + s64 *burstp)
>> {
>> char tok[21]; /* U64_MAX */
>>
>> - if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
>> + if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
>> return -EINVAL;
>>
>> *periodp *= NSEC_PER_USEC;
>> + *burstp *= NSEC_PER_USEC;
>>
>> if (sscanf(tok, "%llu", quotap))
>> *quotap *= NSEC_PER_USEC;
>> @@ -7813,7 +7865,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
>> {
>> struct task_group *tg = css_tg(seq_css(sf));
>>
>> - cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
>> + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
>> + tg_get_cfs_burst(tg));
>> return 0;
>> }
>>
>> @@ -7822,12 +7875,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
>> {
>> struct task_group *tg = css_tg(of_css(of));
>> u64 period = tg_get_cfs_period(tg);
>> + s64 burst = tg_get_cfs_burst(tg);
>> u64 quota;
>> int ret;
>>
>> - ret = cpu_period_quota_parse(buf, &period, "a);
>> + ret = cpu_period_quota_parse(buf, &period, "a, &burst);
>> if (!ret)
>> - ret = tg_set_cfs_bandwidth(tg, period, quota);
>> + ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
>> return ret ?: nbytes;
>> }
>> #endif
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 682a754ea3e1..26b8dac67c79 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4353,16 +4353,26 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>> }
>>
>> /*
>> - * Replenish runtime according to assigned quota. We use sched_clock_cpu
>> - * directly instead of rq->clock to avoid adding additional synchronization
>> - * around rq->lock.
>> + * Replenish runtime according to assigned quota.
>> + * Called only if quota != RUNTIME_INF.
>> *
>> * requires cfs_b->lock
>> */
>> void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> {
>> - if (cfs_b->quota != RUNTIME_INF)
>> - cfs_b->runtime = cfs_b->quota;
>> + u64 runtime = cfs_b->runtime;
>> +
>> + /*
>> + * Preserve past runtime up to burst size. If remaining runtime lower
>> + * than previous burst runtime then account delta as used burst time.
>> + */
>> + if (runtime > cfs_b->burst)
>> + runtime = cfs_b->burst;
>> + else if (runtime < cfs_b->burst_runtime)
>> + cfs_b->burst_time += cfs_b->burst_runtime - runtime;
>> +
>> + cfs_b->burst_runtime = runtime;
>> + cfs_b->runtime = runtime + cfs_b->quota;
>> }
>>
>> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4968,6 +4978,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> cfs_b->runtime = 0;
>> cfs_b->quota = RUNTIME_INF;
>> cfs_b->period = ns_to_ktime(default_cfs_period());
>> + cfs_b->burst = 0;
>> + cfs_b->burst_runtime = 0;
>> + cfs_b->burst_time = 0;
>>
>> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>> @@ -4986,14 +4999,23 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>
>> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> {
>> + u64 overrun;
>> +
>> lockdep_assert_held(&cfs_b->lock);
>>
>> if (cfs_b->period_active)
>> return;
>>
>> cfs_b->period_active = 1;
>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> + overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>> +
>> + /*
>> + * Refill runtime for periods of inactivity and current.
>> + * __refill_cfs_bandwidth_runtime() will cut excess.
>> + */
>> + cfs_b->runtime += cfs_b->quota * overrun;
>> + __refill_cfs_bandwidth_runtime(cfs_b);
>> }
>>
>> static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 0db2c1b3361e..eaecd7298f80 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -344,10 +344,14 @@ struct cfs_bandwidth {
>> struct hrtimer slack_timer;
>> struct list_head throttled_cfs_rq;
>>
>> + u64 burst;
>> + u64 burst_runtime;
>> +
>> /* Statistics: */
>> int nr_periods;
>> int nr_throttled;
>> u64 throttled_time;
>> + u64 burst_time;
>> #endif
>> };
>>
>>
Powered by blists - more mailing lists