linux-kernel - Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZGNBt7vWJ3fDs5Sc@chenyu5-mobl1>
Date:   Tue, 16 May 2023 16:41:27 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Mike Galbraith <efault@....de>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        "Steven Rostedt" <rostedt@...dmis.org>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Abel Wu" <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Arjan Van De Ven <arjan.van.de.ven@...el.com>,
        Aaron Lu <aaron.lu@...el.com>, Barry Song <baohua@...nel.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on
 local idle core first

On 2023-05-16 at 08:23:35 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> > [Problem Statement]
> >
> ...
> 
> > 20.26%    19.89%  [kernel.kallsyms]          [k] update_cfs_group
> > 13.53%    12.15%  [kernel.kallsyms]          [k] update_load_avg
> 
> Yup, that's a serious problem, but...
> 
> > [Benchmark]
> >
> > The baseline is on sched/core branch on top of
> > commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
> >
> > Tested will-it-scale context_switch1 case, it shows good improvement
> > both on a server and a desktop:
> >
> > Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> > context_switch1_processes -s 100 -t 112 -n
> > baseline                   SIS_PAIR
> > 1.0                        +68.13%
> >
> > Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> > context_switch1_processes -s 100 -t 18 -n
> > baseline                   SIS_PAIR
> > 1.0                        +45.2%
> 
> git@...er: ./context_switch1_processes -s 100 -t 8 -n
> (running in an autogroup)
> 
>    PerfTop:   30853 irqs/sec  kernel:96.8%  exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------
> 
>      5.72%  [kernel]       [k] switch_mm_irqs_off
>      4.23%  [kernel]       [k] __update_load_avg_se
>      3.76%  [kernel]       [k] __update_load_avg_cfs_rq
>      3.70%  [kernel]       [k] __schedule
>      3.65%  [kernel]       [k] entry_SYSCALL_64
>      3.22%  [kernel]       [k] enqueue_task_fair
>      2.91%  [kernel]       [k] update_curr
>      2.67%  [kernel]       [k] select_task_rq_fair
>      2.60%  [kernel]       [k] pipe_read
>      2.55%  [kernel]       [k] __switch_to
>      2.54%  [kernel]       [k] __calc_delta
>      2.44%  [kernel]       [k] dequeue_task_fair
>      2.38%  [kernel]       [k] reweight_entity
>      2.13%  [kernel]       [k] pipe_write
>      1.96%  [kernel]       [k] restore_fpregs_from_fpstate
>      1.93%  [kernel]       [k] select_idle_smt
>      1.77%  [kernel]       [k] update_load_avg <==
>      1.73%  [kernel]       [k] native_sched_clock
>      1.66%  [kernel]       [k] try_to_wake_up
>      1.52%  [kernel]       [k] _raw_spin_lock_irqsave
>      1.47%  [kernel]       [k] update_min_vruntime
>      1.42%  [kernel]       [k] update_cfs_group <==
>      1.36%  [kernel]       [k] vfs_write
>      1.32%  [kernel]       [k] prepare_to_wait_event
> 
> ...not one with global scope.  My little i7-4790 can play ping-pong all
> day long, as can untold numbers of other boxen around the globe.
>
That is true, on smaller systems, the C2C overhead is not that high. 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..e65028dcd6a6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >             asym_fits_cpu(task_util, util_min, util_max, target))
> >                 return target;
> >  
> > +       /*
> > +        * If the waker and the wakee are good friends to each other,
> > +        * putting them within the same SMT domain could reduce C2C
> > +        * overhead. SMT idle sibling should be preferred to wakee's
> > +        * previous CPU, because the latter could still have the risk of C2C
> > +        * overhead.
> > +        */
> > +       if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> > +           current->last_wakee == p && p->last_wakee == current) {
> > +               i = select_idle_smt(p, smp_processor_id());
> > +
> > +               if ((unsigned int)i < nr_cpumask_bits)
> > +                       return i;
> > +       }
> > +
> >         /*
> >          * If the previous CPU is cache affine and idle, don't be stupid:
> >          */
> 
> Global scope solutions for non-global issues tend to not work out.  
> 
> Below is a sample of potential scaling wreckage for boxen that are NOT
> akin to the one you're watching turn caches into silicon based pudding.
> 
> Note the *_RR numbers.  Those poked me in the eye because they closely
> resemble pipe ping-pong, all fun and games with about as close to zero
> work other than scheduling as network-land can get, but for my box, SMT
> was the third best option of three.
> 
> You just can't beat idle core selection when it comes to getting work
> done, which is why SIS evolved to select cores first.
> 
There could be some corner cases. Under some conditions choosing an idle
CPU within the local core might be better to select a new idle core. The tricky
part is that SMT is quite special, SMTs share L2, but SMTs also
compete for other critical resources. For short tasks having a close relationship with
each other, putting them together on a local Core (on a high count
system) could sometimes bring benefit. The short duration means that the task
pair have less chance to compete for instruction unit shared by SMTs.
But the short-duration threshold depends on the number of CPUs in the LLC.
> Your box and ilk need help that treats the disease and not the symptom,
> or barring that, help that precisely targets boxen having the disease.
> 
IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
a system with a large number of LLC and with SMT enabled, the issue is easy to
be detected.

As an example, I did not choose a super big system,
but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 2
average:956883

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 2 -n
average:849209

We can see that, waking up the wakee on local core brings benefits on this platform.

To make a comparison, I also launched the same test on my laptop
i5-8300H, which has 4Core/8CPUs, and I did not see any difference when running 2 pairs
of will-it-scale, but I did notice an improvement if wakees are woken up on local
core when launching 4 pairs(I guess this is because C2C reduction accumulates):

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 4
average:731965

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 4 -n
average:644337


thanks,
Chenyu

> 	-Mike
> 
> 10 seconds of 1 netperf client/server instance, no knobs twiddled.
> 
> TCP_SENDFILE-1  stacked    Avg:  65387
> TCP_SENDFILE-1  cross-smt  Avg:  65658
> TCP_SENDFILE-1  cross-core Avg:  96318
> 
> TCP_STREAM-1    stacked    Avg:  44322
> TCP_STREAM-1    cross-smt  Avg:  42390
> TCP_STREAM-1    cross-core Avg:  77850
> 
> TCP_MAERTS-1    stacked    Avg:  36636
> TCP_MAERTS-1    cross-smt  Avg:  42333
> TCP_MAERTS-1    cross-core Avg:  74122
> 
> UDP_STREAM-1    stacked    Avg:  52618
> UDP_STREAM-1    cross-smt  Avg:  55298
> UDP_STREAM-1    cross-core Avg:  97415
> 
> TCP_RR-1        stacked    Avg: 242606
> TCP_RR-1        cross-smt  Avg: 140863
> TCP_RR-1        cross-core Avg: 219400
> 
> UDP_RR-1        stacked    Avg: 282253
> UDP_RR-1        cross-smt  Avg: 202062
> UDP_RR-1        cross-core Avg: 288620