linux-kernel - Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFpoUr2mNO87XFAyHF=HA3f6KC8EkuGrwQQe54q4kmF1WgfG7w@mail.gmail.com>
Date:   Thu, 20 May 2021 16:00:29 +0200
From:   Odin Ugedal <odin@...d.al>
To:     Huaixin Chang <changhuaixin@...ux.alibaba.com>
Cc:     Benjamin Segall <bsegall@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        dtcccc@...ux.alibaba.com, Juri Lelli <juri.lelli@...hat.com>,
        khlebnikov@...dex-team.ru,
        open list <linux-kernel@...r.kernel.org>,
        Mel Gorman <mgorman@...e.de>, Ingo Molnar <mingo@...hat.com>,
        Odin Ugedal <odin@...d.al>, pauld@...head.com,
        Peter Zijlstra <peterz@...radead.org>,
        Paul Turner <pjt@...gle.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        shanpeic@...ux.alibaba.com, Tejun Heo <tj@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        xiyou.wangcong@...il.com
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

Hi,

Here are some more thoughts and questions:

> The benefit of burst is seen when testing with schbench:
>
>         echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>         echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>         echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>         echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
>         # The average CPU usage is around 500%, which is 200ms CPU time
>         # every 40ms.
>         ./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
>
>         Without burst:
>
>         Latency percentiles (usec)
>         50.0000th: 7
>         75.0000th: 8
>         90.0000th: 9
>         95.0000th: 10
>         *99.0000th: 933
>         99.5000th: 981
>         99.9000th: 3068
>         min=0, max=20054
>         rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

It should be noted that this was running on a 64 core machine (if that was
the case, ref. your previous patch).

I am curious how much you have tried tweaking both the period and the quota
for this workload. I assume a longer period can help such bursty application,
and from the small slowdowns, a slightly higher quota could also help
I guess. I am
not saying this is a bad idea, but that we need to understand what it
fixes, and how,
in order to be able to understand how/if to use it.

Also, what value of the sysctl kernel.sched_cfs_bandwidth_slice_us are
you using?
What CONFIG_HZ you are using is also interesting, due to how bw is
accounted for.
There is some more info about it here: Documentation/scheduler/sched-bwc.rst. I
assume a smaller slice value may also help, and it would be interesting to see
what implications it gives. A high threads to (quota/period) ratio, together
with a high bandwidth_slice will probably cause some throttling, so one has
to choose between precision and overhead.

Also, here you give a burst of 66% the quota. Would that be a typical value
for a cgroup, or is it just a result of testing? As I understand this
patchset, your example
would allow 600% constant CPU load, then one period with 1000% load,
then another
"long set" of periods with 600% load. Have you discussed a way of limiting how
long burst can be "saved" before expiring?

> @@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
>  {
>         struct task_group *tg = css_tg(seq_css(sf));
>
> -       cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
> +       cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
> +                              tg_get_cfs_burst(tg));
>         return 0;
>  }

The current cgroup v2 docs say the following:

>   cpu.max
>     A read-write two value file which exists on non-root cgroups.
>     The default is "max 100000".

This will become a "three value file", and I know a few user space projects
who parse this file by splitting on the middle space. I am not sure if they are
"wrong", but I don't think we usually break such things. Not sure what
Tejun thinks about this.

Thanks
Odin