linux-kernel - Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for task wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ac73a9e2-8cc0-b1fe-fc2b-14b9cb21c8bf@linux.intel.com>
Date:   Mon, 9 Nov 2020 21:40:53 +0800
From:   "Li, Aubrey" <aubrey.li@...ux.intel.com>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        tim.c.chen@...ux.intel.com, linux-kernel@...r.kernel.org,
        Aubrey Li <aubrey.li@...el.com>,
        Qais Yousef <qais.yousef@....com>,
        Jiang Biao <benbjiang@...il.com>
Subject: Re: [RFC PATCH v3] sched/fair: select idle cpu from idle cpumask for
 task wakeup

On 2020/11/7 5:20, Valentin Schneider wrote:
> 
> On 21/10/20 16:03, Aubrey Li wrote:
>> From: Aubrey Li <aubrey.li@...el.com>
>>
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
> 
> FWIW I gave this a spin on my arm64 desktop (Ampere eMAG, 32 core). I get
> some barely noticeable (AIUI not statistically significant for bench sched)
> changes for 100 iterations of:
> 
> | bench                              | metric   |   mean |     std |    q90 |    q99 |
> |------------------------------------+----------+--------+---------+--------+--------|
> | hackbench --loops 5000 --groups 1  | duration | -1.07% |  -2.23% | -0.88% | -0.25% |
> | hackbench --loops 5000 --groups 2  | duration | -0.79% | +30.60% | -0.49% | -0.74% |
> | hackbench --loops 5000 --groups 4  | duration | -0.54% |  +6.99% | -0.21% | -0.12% |
> | perf bench sched pipe -T -l 100000 | ops/sec  | +1.05% |  -2.80% | -0.17% | +0.39% |
> 
> q90 & q99 being the 90th and 99th percentile.
> 
> Base was tip/sched/core at:
> d8fcb81f1acf ("sched/fair: Check for idle core in wake_affine")

Thanks for the data, Valentin! So does the negative value mean improvement?

If so the data looks expected to me. As we set idle cpumask every time we
enter idle, but only clear it at the tick frequency, so if the workload
is not heavy enough, there could be a lot of idle during two ticks, so idle
cpumask is almost equal to sched_domain_span(sd), which makes no difference.

But if the system load is heavy enough, CPU has few/no chance to enter idle,
then idle cpumask can be cleared during tick, which makes the bit number in 
sds_idle_cpus(sd->shared) far less than the bit number in sched_domain_span(sd)
if llc domain has large count of CPUs.

For example, if I run 4 x overcommit uperf on a system with 192 CPUs, 
I observed:
- default, the average of this_sd->avg_scan_cost is 223.12ns
- patch, the average of this_sd->avg_scan_cost is 63.4ns

And select_idle_cpu is called 7670253 times per second, so for every CPU the
scan cost is saved (223.12 - 63.4) * 7670253 / 192 = 6.4ms. As a result, I
saw uperf thoughput improved by 60+%.

Thanks,
-Aubrey