linux-kernel - Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19664c68f77f5b23a86e5636a17ad2cbfa073f78.camel@gmx.de>
Date:   Tue, 16 May 2023 08:23:35 +0200
From:   Mike Galbraith <efault@....de>
To:     Chen Yu <yu.c.chen@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Arjan Van De Ven <arjan.van.de.ven@...el.com>,
        Aaron Lu <aaron.lu@...el.com>, Barry Song <baohua@...nel.org>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on
 local idle core first

On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> [Problem Statement]
>
...

> 20.26%    19.89%  [kernel.kallsyms]          [k] update_cfs_group
> 13.53%    12.15%  [kernel.kallsyms]          [k] update_load_avg

Yup, that's a serious problem, but...

> [Benchmark]
>
> The baseline is on sched/core branch on top of
> commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
>
> Tested will-it-scale context_switch1 case, it shows good improvement
> both on a server and a desktop:
>
> Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> context_switch1_processes -s 100 -t 112 -n
> baseline                   SIS_PAIR
> 1.0                        +68.13%
>
> Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> context_switch1_processes -s 100 -t 18 -n
> baseline                   SIS_PAIR
> 1.0                        +45.2%

git@...er: ./context_switch1_processes -s 100 -t 8 -n
(running in an autogroup)

   PerfTop:   30853 irqs/sec  kernel:96.8%  exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------

     5.72%  [kernel]       [k] switch_mm_irqs_off
     4.23%  [kernel]       [k] __update_load_avg_se
     3.76%  [kernel]       [k] __update_load_avg_cfs_rq
     3.70%  [kernel]       [k] __schedule
     3.65%  [kernel]       [k] entry_SYSCALL_64
     3.22%  [kernel]       [k] enqueue_task_fair
     2.91%  [kernel]       [k] update_curr
     2.67%  [kernel]       [k] select_task_rq_fair
     2.60%  [kernel]       [k] pipe_read
     2.55%  [kernel]       [k] __switch_to
     2.54%  [kernel]       [k] __calc_delta
     2.44%  [kernel]       [k] dequeue_task_fair
     2.38%  [kernel]       [k] reweight_entity
     2.13%  [kernel]       [k] pipe_write
     1.96%  [kernel]       [k] restore_fpregs_from_fpstate
     1.93%  [kernel]       [k] select_idle_smt
     1.77%  [kernel]       [k] update_load_avg <==
     1.73%  [kernel]       [k] native_sched_clock
     1.66%  [kernel]       [k] try_to_wake_up
     1.52%  [kernel]       [k] _raw_spin_lock_irqsave
     1.47%  [kernel]       [k] update_min_vruntime
     1.42%  [kernel]       [k] update_cfs_group <==
     1.36%  [kernel]       [k] vfs_write
     1.32%  [kernel]       [k] prepare_to_wait_event

...not one with global scope.  My little i7-4790 can play ping-pong all
day long, as can untold numbers of other boxen around the globe.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..e65028dcd6a6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>             asym_fits_cpu(task_util, util_min, util_max, target))
>                 return target;
>  
> +       /*
> +        * If the waker and the wakee are good friends to each other,
> +        * putting them within the same SMT domain could reduce C2C
> +        * overhead. SMT idle sibling should be preferred to wakee's
> +        * previous CPU, because the latter could still have the risk of C2C
> +        * overhead.
> +        */
> +       if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> +           current->last_wakee == p && p->last_wakee == current) {
> +               i = select_idle_smt(p, smp_processor_id());
> +
> +               if ((unsigned int)i < nr_cpumask_bits)
> +                       return i;
> +       }
> +
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */

Global scope solutions for non-global issues tend to not work out.  

Below is a sample of potential scaling wreckage for boxen that are NOT
akin to the one you're watching turn caches into silicon based pudding.

Note the *_RR numbers.  Those poked me in the eye because they closely
resemble pipe ping-pong, all fun and games with about as close to zero
work other than scheduling as network-land can get, but for my box, SMT
was the third best option of three.

You just can't beat idle core selection when it comes to getting work
done, which is why SIS evolved to select cores first.

Your box and ilk need help that treats the disease and not the symptom,
or barring that, help that precisely targets boxen having the disease.

	-Mike

10 seconds of 1 netperf client/server instance, no knobs twiddled.

TCP_SENDFILE-1  stacked    Avg:  65387
TCP_SENDFILE-1  cross-smt  Avg:  65658
TCP_SENDFILE-1  cross-core Avg:  96318

TCP_STREAM-1    stacked    Avg:  44322
TCP_STREAM-1    cross-smt  Avg:  42390
TCP_STREAM-1    cross-core Avg:  77850

TCP_MAERTS-1    stacked    Avg:  36636
TCP_MAERTS-1    cross-smt  Avg:  42333
TCP_MAERTS-1    cross-core Avg:  74122

UDP_STREAM-1    stacked    Avg:  52618
UDP_STREAM-1    cross-smt  Avg:  55298
UDP_STREAM-1    cross-core Avg:  97415

TCP_RR-1        stacked    Avg: 242606
TCP_RR-1        cross-smt  Avg: 140863
TCP_RR-1        cross-core Avg: 219400

UDP_RR-1        stacked    Avg: 282253
UDP_RR-1        cross-smt  Avg: 202062
UDP_RR-1        cross-core Avg: 288620