linux-kernel - Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZE/gT7bkmIFkLdkg@chenyu5-mobl1>
Date:   Mon, 1 May 2023 23:52:47 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tim Chen <tim.c.chen@...el.com>,
        "Dietmar Eggemann" <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Honglei Wang <wanghonglei@...ichuxing.com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Tianchen Ding <dtcccc@...ux.alibaba.com>,
        "Joel Fernandes" <joel@...lfernandes.org>,
        Josh Don <joshdon@...gle.com>,
        "kernel test robot" <yujie.liu@...el.com>,
        Arjan Van De Ven <arjan.van.de.ven@...el.com>,
        Aaron Lu <aaron.lu@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up
 short task on current CPU

Hi Peter,
On 2023-05-01 at 15:48:27 +0200, Peter Zijlstra wrote:
> On Sat, Apr 29, 2023 at 07:16:56AM +0800, Chen Yu wrote:
> > netperf
> > =======
> > case                    load            baseline(std%)  compare%( std%)
> > TCP_RR                  56-threads       1.00 (  1.96)  +15.23 (  4.67)
> > TCP_RR                  112-threads      1.00 (  1.84)  +88.83 (  4.37)
> > TCP_RR                  168-threads      1.00 (  0.41)  +475.45 (  4.45)
> > TCP_RR                  224-threads      1.00 (  0.62)  +806.85 (  3.67)
> > TCP_RR                  280-threads      1.00 ( 65.80)  +162.66 ( 10.26)
> > TCP_RR                  336-threads      1.00 ( 17.30)   -0.19 ( 19.07)
> > TCP_RR                  392-threads      1.00 ( 26.88)   +3.38 ( 28.91)
> > TCP_RR                  448-threads      1.00 ( 36.43)   -0.26 ( 33.72)
> > UDP_RR                  56-threads       1.00 (  7.91)   +3.77 ( 17.48)
> > UDP_RR                  112-threads      1.00 (  2.72)  -15.02 ( 10.78)
> > UDP_RR                  168-threads      1.00 (  8.86)  +131.77 ( 13.30)
> > UDP_RR                  224-threads      1.00 (  9.54)  +178.73 ( 16.75)
> > UDP_RR                  280-threads      1.00 ( 15.40)  +189.69 ( 19.36)
> > UDP_RR                  336-threads      1.00 ( 24.09)   +0.54 ( 22.28)
> > UDP_RR                  392-threads      1.00 ( 39.63)   -3.90 ( 33.77)
> > UDP_RR                  448-threads      1.00 ( 43.57)   +1.57 ( 40.43)
> > 
> > tbench
> > ======
> > case                    load            baseline(std%)  compare%( std%)
> > loopback                56-threads       1.00 (  0.50)  +10.78 (  0.52)
> > loopback                112-threads      1.00 (  0.19)   +2.73 (  0.08)
> > loopback                168-threads      1.00 (  0.09)  +173.72 (  0.47)
> > loopback                224-threads      1.00 (  0.20)   -2.13 (  0.42)
> > loopback                280-threads      1.00 (  0.06)   -0.77 (  0.15)
> > loopback                336-threads      1.00 (  0.14)   -0.08 (  0.08)
> > loopback                392-threads      1.00 (  0.17)   -0.27 (  0.86)
> > loopback                448-threads      1.00 (  0.37)   +0.32 (  0.02)
> 
> So,... I've been poking around with this a bit today and I'm not seeing
> it. On my ancient IVB-EP (2*10*2) with the code as in
> queue/sched/core I get:
> 
> netperf           NO_WA_WEIGHT               NO_SIS_CURRENT
>                                  NO_WA_BIAS             SIS_CURRENT
> -------------------------------------------------------------------
> TCP_SENDFILE-1  : Avg: 40495.7    41899.7    42001      40783.4
> TCP_SENDFILE-10 : Avg: 37218.6    37200.1    37065.1    36604.4
> TCP_SENDFILE-20 : Avg: 21495.1    21516.6    21004.4    21356.9
> TCP_SENDFILE-40 : Avg: 6947.24    7917.64    7079.93    7231.3
> TCP_SENDFILE-80 : Avg: 4081.91    3572.48    3582.98    3615.85
> TCP_STREAM-1    : Avg: 37078.1    34469.4    37134.5    35095.4
> TCP_STREAM-10   : Avg: 31532.1    31265.8    31260.7    31588.1
> TCP_STREAM-20   : Avg: 17848      17914.9    17996.6    17937.4
> TCP_STREAM-40   : Avg: 7844.3     7201.65    7710.4     7790.62
> TCP_STREAM-80   : Avg: 2518.38    2932.74    2601.51    2903.89
> TCP_RR-1        : Avg: 84347.1    81056.2    81167.8    83541.3
> TCP_RR-10       : Avg: 71539.1    72099.5    71123.2    69447.9
> TCP_RR-20       : Avg: 51053.3    50952.4    50905.4    52157.2
> TCP_RR-40       : Avg: 46370.9    46477.5    46289.2    46350.7
> TCP_RR-80       : Avg: 21515.2    22497.9    22024.4    22229.2
> UDP_RR-1        : Avg: 96933      100076     95997.2    96553.3
> UDP_RR-10       : Avg: 83937.3    83054.3    83878.5    78998.6
> UDP_RR-20       : Avg: 61974      61897.5    61838.8    62926
> UDP_RR-40       : Avg: 56708.6    57053.9    56456.1    57115.2
> UDP_RR-80       : Avg: 26950      27895.8    27635.2    27784.8
> UDP_STREAM-1    : Avg: 52808.3    55296.8    52808.2    51908.6
> UDP_STREAM-10   : Avg: 45810      42944.1    43115      43561.2
> UDP_STREAM-20   : Avg: 19212.7    17572.9    18798.7    20066
> UDP_STREAM-40   : Avg: 13105.1    13096.9    13070.5    13110.2
> UDP_STREAM-80   : Avg: 6372.57    6367.96    6248.86    6413.09
> 
> 
> tbench
> 
> NO_WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
> 
> Throughput  626.57 MB/sec   2 clients   2 procs  max_latency=0.095 ms
> Throughput 1316.08 MB/sec   5 clients   5 procs  max_latency=0.106 ms
> Throughput 1905.19 MB/sec  10 clients  10 procs  max_latency=0.161 ms
> Throughput 2428.05 MB/sec  20 clients  20 procs  max_latency=0.284 ms
> Throughput 2323.16 MB/sec  40 clients  40 procs  max_latency=0.381 ms
> Throughput 2229.93 MB/sec  80 clients  80 procs  max_latency=0.873 ms
> 
> WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
> 
> Throughput  575.04 MB/sec   2 clients   2 procs  max_latency=0.093 ms
> Throughput 1285.37 MB/sec   5 clients   5 procs  max_latency=0.122 ms
> Throughput 1916.10 MB/sec  10 clients  10 procs  max_latency=0.150 ms
> Throughput 2422.54 MB/sec  20 clients  20 procs  max_latency=0.292 ms
> Throughput 2361.57 MB/sec  40 clients  40 procs  max_latency=0.448 ms
> Throughput 2479.70 MB/sec  80 clients  80 procs  max_latency=1.249 ms
> 
> WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
> 
> Throughput  649.46 MB/sec   2 clients   2 procs  max_latency=0.092 ms
> Throughput 1370.93 MB/sec   5 clients   5 procs  max_latency=0.140 ms
> Throughput 1904.14 MB/sec  10 clients  10 procs  max_latency=0.470 ms
> Throughput 2406.15 MB/sec  20 clients  20 procs  max_latency=0.276 ms
> Throughput 2419.40 MB/sec  40 clients  40 procs  max_latency=0.414 ms
> Throughput 2426.00 MB/sec  80 clients  80 procs  max_latency=1.366 ms
> 
> WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
> 
> Throughput  646.55 MB/sec   2 clients   2 procs  max_latency=0.104 ms
> Throughput 1361.06 MB/sec   5 clients   5 procs  max_latency=0.100 ms
> Throughput 1889.82 MB/sec  10 clients  10 procs  max_latency=0.154 ms
> Throughput 2406.57 MB/sec  20 clients  20 procs  max_latency=3.667 ms
> Throughput 2318.00 MB/sec  40 clients  40 procs  max_latency=0.390 ms
> Throughput 2384.85 MB/sec  80 clients  80 procs  max_latency=1.371 ms
> 
> 
> So what's going on here? I don't see anything exciting happening at the
> 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> up either :/
> 
Thank you very much for trying this patch. This patch was found to mainly
benefit system with large number of CPUs in 1 LLC. Previously I tested
it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
seems to have benefit on them. The benefit seems to come from:
1. reducing the waker stacking among many CPUs within 1 LLC
2. reducing the C2C overhead within 1 LLC
As a comparison, Prateek has tested this patch on the Zen3 platform,
which has 16 threads per LLC and smaller than Sapphire Rapids and Ice
Lake Server. He did not observe too much difference with this patch
applied, but only saw some limited improvement on tbench and Spec.
So far I did not received performance difference from LKP on desktop
test boxes. Let me queue the full test on some desktops to confirm
if this change has any impact on them.

[1] https://lore.kernel.org/lkml/202211021600.ceb04ba9-yujie.liu@intel.com/

thanks,
Chenyu


The original symptom I found was that, there are
quite some idle time(up to 30%) when running will-it-scale context switch
using the same number as the online CPUs. And waking up the task locally
reduce the race condition and reduce the C2C overhead within 1 LLC,
which is more severe on a system with large number of CPUs in 1 LLC.