linux-kernel - Re: [PATCH] sched: Reduce rq lock contention in load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y5ijqxqKPVxLebkh@chenyu5-mobl1>
Date:   Wed, 14 Dec 2022 00:09:15 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     chenying <chenying.kernel@...edance.com>
CC:     Abel Wu <wuyun.abel@...edance.com>, <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Benjamin Segall <bsegall@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched: Reduce rq lock contention in load_balance()

On 2022-12-13 at 11:13:24 +0800, chenying wrote:
> From: chenying <chenying.kernel@...edance.com>
> 
> When doing newidle load balancing, we may have lock contention on rq->lock
> while finding the same busiest rq on multiple cpus. However, it is often
> the case that after the first load balancing, the busiest-rq may not be the
> busiest anymore. This may lead to pointless waits for locks.
> 
> Add rq->balancing to quickly check if the busiest rq has been selected
> in load_balance on other cpu. If it has been selected, clear the busiest
> rq's
> cpu from load_balance_mask and then goto refind.
> 
> The test results show that this patch brings ~30% rq lock contentions
> reduced and no scheduling latency degradation.
> 
> unpatched:
> lock_stat version 0.4
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>                                &rq->lock:         25532          26471
> 0.09          22.86       42250.81           1.60 1232063        6586225
> 0.05          40.54    10280028.19       1.56
>                                ---------
>                                &rq->lock           1310 [<0000000081600630>]
> __schedule+0xa9/0x800
>                                &rq->lock           1430 [<00000000754f510d>]
> try_to_wake_up+0x206/0x710
>                                &rq->lock          15426 [<0000000020af4cb5>]
> update_blocked_averages+0x30/0x6f0
>                                &rq->lock           1449 [<00000000dc949053>]
> _nohz_idle_balance+0x116/0x250
>                                ---------
>                                &rq->lock           3329 [<00000000754f510d>]
> try_to_wake_up+0x206/0x710
>                                &rq->lock           1241 [<0000000081600630>]
> __schedule+0xa9/0x800
>                                &rq->lock          15480 [<0000000020af4cb5>]
> update_blocked_averages+0x30/0x6f0
>                                &rq->lock           5333 [<000000004969102f>]
> load_balance+0x3b7/0xe40
>
Does the scenario above indicate that one CPU is trying to grab the rq lock
in either __schedule or try_to_wake_up or update_blocked_averages or
_nohz_idle_balance.
but it could be grabbed by another CPU at load_balance+0x3b7/0xe40,
and this patch is trying to avoid grabbing the rq lock in load_balance()
as much as possible?
And it seems that update_blocked_averages is quite contended too.
> patched:
> lock_stat version 0.4
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> .............................................................................................................................................................................................................................
> 
>                                &rq->lock:         17497          18300
> 0.09          23.15       32152.22           1.76 1137409        6484176
> 0.05          40.19    10125220.60       1.56
>                                ---------
>                                &rq->lock          12298 [<000000004314e22f>]
> update_blocked_averages+0x30/0x6f0
>                                &rq->lock           1005 [<000000005b222b90>]
> __schedule+0xa9/0x800
>                                &rq->lock           1271 [<00000000c7a66a89>]
> try_to_wake_up+0x206/0x710
>                                &rq->lock           1380 [<00000000eac23b6b>]
> load_balance+0x560/0xe70
>                                ---------
>                                &rq->lock           2962 [<00000000c7a66a89>]
> try_to_wake_up+0x206/0x710
>                                &rq->lock          11844 [<000000004314e22f>]
> update_blocked_averages+0x30/0x6f0
>                                &rq->lock            592 [<0000000032421516>]
> scheduler_tick+0x4f/0xf0
>                                &rq->lock           1243 [<000000005b222b90>]
> __schedule+0xa9/0x800
> 
> unpatched:
>  # ./runqlat 60 1
> 
>     usecs               : count     distribution
>          0 -> 1          : 1172     |       |
>          2 -> 3          : 210063   |************************       |
>          4 -> 7          : 337576 |****************************************|
>          8 -> 15         : 24555    |**       |
>         16 -> 31         : 13598    |*       |
>         32 -> 63         : 779      |       |
>         64 -> 127        : 230      |       |
>        128 -> 255        : 83       |       |
>        256 -> 511        : 54       |       |
>        512 -> 1023       : 62       |       |
>       1024 -> 2047       : 123      |       |
>       2048 -> 4095       : 283      |       |
>       4096 -> 8191       : 1362     |       |
>       8192 -> 16383      : 2775     |       |
>      16384 -> 32767      : 52352    |******       |
>      32768 -> 65535      : 14       |       |
>      65536 -> 131071     : 140      |       |
> 
>  patched:
>  # ./runqlat 60 1
> 
>      usecs               : count     distribution
>          0 -> 1          : 1091     |       |
>          2 -> 3          : 205259   |***********************       |
>          4 -> 7          : 351620 |****************************************|
>          8 -> 15         : 27812    |***       |
>         16 -> 31         : 13971    |*       |
>         32 -> 63         : 727      |       |
>         64 -> 127        : 198      |       |
>        128 -> 255        : 103      |       |
>        256 -> 511        : 61       |       |
>        512 -> 1023       : 45       |       |
>       1024 -> 2047       : 108      |       |
>       2048 -> 4095       : 271      |       |
>       4096 -> 8191       : 1342     |       |
>       8192 -> 16383      : 2732     |       |
>      16384 -> 32767      : 49367    |*****       |
>      32768 -> 65535      : 8        |       |
>      65536 -> 131071     : 183      |       |
> 
> Below is the script to run the sysbench workload:
> 
>         #!/bin/bash
> 
>         mkdir /sys/fs/cgroup/cpuset/test1
>         echo 12,14,16,18,20,22 > /sys/fs/cgroup/cpuset/test1/cpuset.cpus
>         echo 0 > /sys/fs/cgroup/cpuset/test1/cpuset.mems
> 
>         mkdir /sys/fs/cgroup/cpuset/test2
>         echo
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 >
> /sys/fs/cgroup/cpuset/test2/cpuset.cpus
>         echo 0 > /sys/fs/cgroup/cpuset/test2/cpuset.mems
> 
>         cgexec -g cpuset:test1 sysbench --test=cpu --cpu-max-prime=200000
> run --num-threads=24 --rate=100 --time=6000 &
>         cgexec -g cpuset:test2 sysbench --test=cpu --cpu-max-prime=200000
> run --num-threads=96 --rate=100 --time=6000 &
>
May I know how many CPUs are there in the system, 46 * 2 ? So this test is
to saturate test1 and idle CPUs in test2 try to continously pull task from test1
but fail due to affinity, which introduce rq lock contention?
> Suggested-by: Abel Wu <wuyun.abel@...edance.com>
> Signed-off-by: chenying <chenying.kernel@...edance.com>
> ---
>  kernel/sched/core.c  |  1 +
>  kernel/sched/fair.c  | 11 +++++++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index daff72f00385..ca4fa84c8751 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9737,6 +9737,7 @@ void __init sched_init(void)
>                 rq->rd = NULL;
>                 rq->cpu_capacity = rq->cpu_capacity_orig =
> SCHED_CAPACITY_SCALE;
>                 rq->balance_callback = &balance_push_callback;
> +               rq->balancing = false;
Maybe rq->balancing = 0 because balancing is not bool.
>                 rq->active_balance = 0;
>                 rq->next_balance = jiffies;
>                 rq->push_cpu = 0;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..aeb4fa9ac93a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10295,6 +10295,7 @@ static int load_balance(int this_cpu, struct rq
> *this_rq,
>                 goto out_balanced;
>         }
> 
> +refind:
>         busiest = find_busiest_queue(&env, group);
>         if (!busiest) {
>                 schedstat_inc(sd->lb_nobusyq[idle]);
> @@ -10303,6 +10304,14 @@ static int load_balance(int this_cpu, struct rq
> *this_rq,
> 
>         WARN_ON_ONCE(busiest == env.dst_rq);
> 
> +       if (READ_ONCE(busiest->balancing)) {
rq->balancing is not protected by lock so there could be race condition,
but I think it is ok because this could be a trade-off.
> +               __cpumask_clear_cpu(cpu_of(busiest), cpus);
> +               if (cpumask_intersects(sched_group_span(group), cpus))
> +                       goto refind;
> +
> +               goto out_balanced;
> +       }
> +
>         schedstat_add(sd->lb_imbalance[idle], env.imbalance);
> 
>         env.src_cpu = busiest->cpu;
> @@ -10323,6 +10332,7 @@ static int load_balance(int this_cpu, struct rq
> *this_rq,
>  more_balance:
>                 rq_lock_irqsave(busiest, &rf);
>                 update_rq_clock(busiest);
> +               WRITE_ONCE(busiest->balancing, true);
		WRITE_ONCE(busiest->balancing, 1)
> 
>                 /*
>                  * cur_ld_moved - load moved in current iteration
> @@ -10338,6 +10348,7 @@ static int load_balance(int this_cpu, struct rq
> *this_rq,
>                  * See task_rq_lock() family for the details.
>                  */
> 
> +               WRITE_ONCE(busiest->balancing, false);
		WRITE_ONCE(busiest->balancing, 0)

thanks,
Chenyu