linux-kernel - Re: [RFC PATCH v2 0/4] Introduce group balancer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <defa02c1-9660-f335-a764-d89dbe2f502e@linux.alibaba.com>
Date:   Wed, 9 Mar 2022 16:30:51 +0800
From:   Tianchen Ding <dtcccc@...ux.alibaba.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Zefan Li <lizefan.x@...edance.com>, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michael Wang <yun.wang@...ux.alibaba.com>,
        Cruz Zhao <cruzzhao@...ux.alibaba.com>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Nathan Chancellor <nathan@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Gustavo A. R. Silva" <gustavoars@...nel.org>,
        Arnd Bergmann <arnd@...db.de>, Miguel Ojeda <ojeda@...nel.org>,
        Chris Down <chris@...isdown.name>,
        Vipin Sharma <vipinsh@...gle.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [RFC PATCH v2 0/4] Introduce group balancer

On 2022/3/9 01:13, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote:
>> Modern platform are growing fast on CPU numbers. To achieve better
>> utility of CPU resource, multiple apps are starting to sharing the CPUs.
>>
>> What we need is a way to ease confliction in share mode,
>> make groups as exclusive as possible, to gain both performance
>> and resource efficiency.
>>
>> The main idea of group balancer is to fulfill this requirement
>> by balancing groups of tasks among groups of CPUs, consider this
>> as a dynamic demi-exclusive mode. Task trigger work to settle it's
>> group into a proper partition (minimum predicted load), then try
>> migrate itself into it. To gradually settle groups into the most
>> exclusively partition.
>>
>> GB can be seen as an optimize policy based on load balance,
>> it obeys the main idea of load balance and makes adjustment
>> based on that.
>>
>> Our test on ARM64 platform with 128 CPUs shows that,
>> throughput of sysbench memory is improved about 25%,
>> and redis-benchmark is improved up to about 10%.
> 
> The motivation makes sense to me but I'm not sure this is the right way to
> architecture it. We already have the framework to do all these - the sched
> domains and the load balancer. Architecturally, what the suggested patchset
> is doing is building a separate load balancer on top of cpuset after using
> cpuset to disable the existing load balancer, which is rather obviously
> convoluted.
> 

"the sched domains and the load balancer" you mentioned are the ways to 
"balance" tasks on each domains. However, this patchset aims to "group" 
them together to win hot cache and less competition, which is different 
from load balancer. See commit log of the patch 3/4 and this link:
https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@linux.alibaba.com/

> * AFAICS, none of what the suggested code does is all that complicated or
>    needs a lot of input from userspace. it should be possible to parametrize
>    the existing load balancer to behave better.
> 

Group balancer mainly needs 2 inputs from userspace: cpu partition info 
and cgroup info.
Cpu partition info does need user input (and maybe a bit complicated). 
As a result, the division methods are __free__ to users(can refer to 
NUMA nodes, clusters, cache, etc.)
Cgroup info doesn't need extra input. It's naturally configured.

It do parametrize the existing load balancer to behave better.
Group balancer is a kind of optimize policy, and should obey the basic
policy (load balance) and improve it.
The relationship between load balancer and group balancer is explained 
in detail at the above link.

> * If, for some reason, you need more customizable behavior in terms of cpu
>    allocation, which is what cpuset is for, maybe it'd be better to build the
>    load balancer in userspace. That'd fit way better with how cgroup is used
>    in general and with threaded cgroups, it should fit nicely with everything
>    else.
> 

We put group balancer in kernel space because this new policy does not 
depend on userspace apps. It's a "general" feature.
Doing "dynamic cpuset" in userspace may also introduce performance 
issue, since it may need to bind and unbind different cpusets for 
several times, and is too strict(compared with our "soft bind").