linux-kernel - Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFpoUr3nUEWYZjAj+cJp_FL7csOMMS-LE73sb-jjfRNY2fEBDA@mail.gmail.com>
Date:   Fri, 21 May 2021 11:38:50 +0200
From:   Odin Ugedal <odin@...d.al>
To:     changhuaixin <changhuaixin@...ux.alibaba.com>
Cc:     Odin Ugedal <odin@...d.al>, Benjamin Segall <bsegall@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        dtcccc@...ux.alibaba.com, Juri Lelli <juri.lelli@...hat.com>,
        khlebnikov@...dex-team.ru,
        open list <linux-kernel@...r.kernel.org>,
        Mel Gorman <mgorman@...e.de>, Ingo Molnar <mingo@...hat.com>,
        pauld@...head.com, Peter Zijlstra <peterz@...radead.org>,
        Paul Turner <pjt@...gle.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Shanpei Chen <shanpeic@...ux.alibaba.com>,
        Tejun Heo <tj@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        xiyou.wangcong@...il.com
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

Hi,

> Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
> like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
> params carefully to generate burst during each 100ms periods, to show burst works. Longer
> period or higher quota helps indeed, in which case more workloads can be used to generate
> tail latency then.

Yeah, that makes sense. When it comes to fairness (you are talking
about generating tail
latency), I think configuration of cpu shares/weight between cgroups
is more relevant.

How much more tail latency will a cgroup be able to "create" when
doubling the period?

> In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
> is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
> tasks get throttled and CPU is under utilized for the whole system.

Well, it is as strict as we can make it, depending on how one looks at it. We
cannot guarantee anything more strict than the length of a jiffy or
kernel.sched_cfs_bandwidth_slice_us (simplified ofc.), especially since we allow
runtime from one period to be used in another. I think there is a
"big" distinction between
runtime transferred from the cfs_bw to cfs_rq's in a period compared
to the actual runtime used.

> Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

You should mention that in the msg then, since it is highly relevant
to the results. Can you try to tweak
kernel.sched_cfs_bandwidth_slice_us to something like 1ms, and see
what the result will be?

For such a workload and high cfs_bw_slice, a smaller CONFIG_HZ might
also be beneficial (although
there are many things to consider when talking about that, and a lot
of people know more about that than me).

> The following case might be used to prevent getting throttled from many threads and high bandwidth
> slice:
>
> mkdir /sys/fs/cgroup/cpu/test
> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
>
> On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
> around 80%. This will work on a two-core system. It is recommended to  try it multiple times as getting
> throttled doesn't necessarily cause tail latency for schbench.

When I run this, I get the following results without cfs bandwidth enabled.

$ time ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
Latency percentiles (usec) runtime 20 (s) (398 total samples)
        50.0th: 22 (201 samples)
        75.0th: 50 (158 samples)
        90.0th: 50 (0 samples)
        95.0th: 51 (38 samples)
        *99.0th: 51 (0 samples)
        99.5th: 51 (0 samples)
        99.9th: 52 (1 samples)
        min=5, max=52
rps: 19900000.00 p95 (usec) 51 p99 (usec) 51 p95/cputime 0.06% p99/cputime 0.06%
./schbench -m 1 -t 3 -r 20 -c 80000 -R 20  31.85s user 0.00s system
159% cpu 20.021 total

In this case, I see 80% load on two cores, ending at a total of 160%. If setting
period: 100ms and quota: 100ms (aka. 1 cpu), throttling is what
you would expect, or?. In this case, burst wouldn't matter?

Thanks
Odin