linux-kernel - Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220514105544.GA20541@chenyu5-mobl1>
Date:   Sat, 14 May 2022 18:55:44 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Mel Gorman <mgorman@...e.de>,
        Yicong Yang <yangyicong@...ilicon.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Barry Song <21cnbao@...il.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Len Brown <len.brown@...el.com>,
        Ben Segall <bsegall@...gle.com>,
        Aubrey Li <aubrey.li@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Zhang Rui <rui.zhang@...el.com>, linux-kernel@...r.kernel.org,
        Daniel Bristot de Oliveira <bristot@...hat.com>
Subject: Re: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU
 based on sum of util_avg

Hi Prateek,
On Fri, May 13, 2022 at 12:07:00PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> Sorry for the delay with analysis.
>
Thanks very much for the test and analysis in detail.
> 
> Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
> various NPS configuration:
May I know if in all NPS mode, all LLC domains have 16 CPUs?
> 
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Kernel versions:
> - tip:      5.18-rc1 tip sched/core
> - SIS_UTIL:    5.18-rc1 tip sched/core + this patch
> 
> When we began testing, tip was at:
> 
> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
> 
> Following are the results from the benchmark:
> 
> * - Data points of concern
> 
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
> 
> NPS1
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.64 (0.00 pct)         4.70 (-1.29 pct)
>  2-groups:         5.38 (0.00 pct)         5.45 (-1.30 pct)
>  4-groups:         6.15 (0.00 pct)         6.10 (0.81 pct)
>  8-groups:         7.42 (0.00 pct)         7.42 (0.00 pct)
> 16-groups:        10.70 (0.00 pct)        11.69 (-9.25 pct)  *
> 
> NPS2
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.70 (0.00 pct)         4.70 (0.00 pct)
>  2-groups:         5.45 (0.00 pct)         5.46 (-0.18 pct)
>  4-groups:         6.13 (0.00 pct)         6.05 (1.30 pct)
>  8-groups:         7.30 (0.00 pct)         7.05 (3.42 pct)
> 16-groups:        10.30 (0.00 pct)        10.12 (1.74 pct)
> 
> NPS4
> 
> Test:                   tip                     SIS_UTIL
>  1-groups:         4.60 (0.00 pct)         4.75 (-3.26 pct)  *
>  2-groups:         5.41 (0.00 pct)         5.42 (-0.18 pct)
>  4-groups:         6.12 (0.00 pct)         6.00 (1.96 pct)
>  8-groups:         7.22 (0.00 pct)         7.10 (1.66 pct)
> 16-groups:        10.24 (0.00 pct)        10.11 (1.26 pct)
> 
> ~~~~~~~~
> schbench
> ~~~~~~~~
> 
> NPS 1
> 
> #workers:   tip                     SIS_UTIL
>   1:      29.00 (0.00 pct)        21.00 (27.58 pct)
>   2:      28.00 (0.00 pct)        28.00 (0.00 pct)
>   4:      31.50 (0.00 pct)        31.00 (1.58 pct)
>   8:      42.00 (0.00 pct)        39.00 (7.14 pct)
>  16:      56.50 (0.00 pct)        54.50 (3.53 pct)
>  32:      94.50 (0.00 pct)        94.00 (0.52 pct)
>  64:     176.00 (0.00 pct)       175.00 (0.56 pct)
> 128:     404.00 (0.00 pct)       394.00 (2.47 pct)
> 256:     869.00 (0.00 pct)       863.00 (0.69 pct)
> 512:     58432.00 (0.00 pct)     55424.00 (5.14 pct)
> 
> NPS2
> 
> #workers:      tip                     SIS_UTIL
>   1:      26.50 (0.00 pct)        25.00 (5.66 pct)
>   2:      26.50 (0.00 pct)        25.50 (3.77 pct)
>   4:      34.50 (0.00 pct)        34.00 (1.44 pct)
>   8:      45.00 (0.00 pct)        46.00 (-2.22 pct)
>  16:      56.50 (0.00 pct)        60.50 (-7.07 pct)        *
>  32:      95.50 (0.00 pct)        93.00 (2.61 pct)
>  64:     179.00 (0.00 pct)       179.00 (0.00 pct)
> 128:     369.00 (0.00 pct)       376.00 (-1.89 pct)
> 256:     898.00 (0.00 pct)       903.00 (-0.55 pct)
> 512:     56256.00 (0.00 pct)     57088.00 (-1.47 pct)
> 
> NPS4
> 
> #workers:    tip                     SIS_UTIL
>   1:      25.00 (0.00 pct)        21.00 (16.00 pct)
>   2:      28.00 (0.00 pct)        24.00 (14.28 pct)
>   4:      29.50 (0.00 pct)        29.50 (0.00 pct)
>   8:      41.00 (0.00 pct)        37.50 (8.53 pct)
>  16:      65.50 (0.00 pct)        64.00 (2.29 pct)
>  32:      93.00 (0.00 pct)        94.50 (-1.61 pct)
>  64:     170.50 (0.00 pct)       175.50 (-2.93 pct)
> 128:     377.00 (0.00 pct)       368.50 (2.25 pct)
> 256:     867.00 (0.00 pct)       902.00 (-4.03 pct)
> 512:     58048.00 (0.00 pct)     55488.00 (4.41 pct)
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS 1
> 
> Clients:     tip                     SIS_UTIL
>     1    443.31 (0.00 pct)       456.19 (2.90 pct)
>     2    877.32 (0.00 pct)       875.24 (-0.23 pct)
>     4    1665.11 (0.00 pct)      1647.31 (-1.06 pct)
>     8    3016.68 (0.00 pct)      2993.23 (-0.77 pct)
>    16    5374.30 (0.00 pct)      5246.93 (-2.36 pct)
>    32    8763.86 (0.00 pct)      7878.18 (-10.10 pct)     *
>    64    15786.93 (0.00 pct)     12958.47 (-17.91 pct)    *
>   128    26826.08 (0.00 pct)     26741.14 (-0.31 pct)
>   256    24207.35 (0.00 pct)     52041.89 (114.98 pct)
>   512    51740.58 (0.00 pct)     52084.44 (0.66 pct)
>  1024    51177.82 (0.00 pct)     53126.29 (3.80 pct)
> 
> NPS 2
> 
> Clients:     tip                     SIS_UTIL
>     1    449.49 (0.00 pct)       447.96 (-0.34 pct)
>     2    867.28 (0.00 pct)       869.52 (0.25 pct)
>     4    1643.60 (0.00 pct)      1625.91 (-1.07 pct)
>     8    3047.35 (0.00 pct)      2952.82 (-3.10 pct)
>    16    5340.77 (0.00 pct)      5251.41 (-1.67 pct)
>    32    10536.85 (0.00 pct)     8843.49 (-16.07 pct)     *
>    64    16543.23 (0.00 pct)     14265.35 (-13.76 pct)    *
>   128    26400.40 (0.00 pct)     25595.42 (-3.04 pct)
>   256    23436.75 (0.00 pct)     47090.03 (100.92 pct)
>   512    50902.60 (0.00 pct)     50036.58 (-1.70 pct)
>  1024    50216.10 (0.00 pct)     50639.74 (0.84 pct)
> 
> NPS 4
> 
> Clients:     tip                     SIS_UTIL
>     1    443.82 (0.00 pct)       459.93 (3.62 pct)
>     2    849.14 (0.00 pct)       882.17 (3.88 pct)
>     4    1603.26 (0.00 pct)      1629.64 (1.64 pct)
>     8    2972.37 (0.00 pct)      3003.09 (1.03 pct)
>    16    5277.13 (0.00 pct)      5234.07 (-0.81 pct)
>    32    9744.73 (0.00 pct)      9347.90 (-4.07 pct)      *
>    64    15854.80 (0.00 pct)     14180.27 (-10.56 pct)    *
>   128    26116.97 (0.00 pct)     24597.45 (-5.81 pct)     *
>   256    22403.25 (0.00 pct)     47385.09 (111.50 pct)
>   512    48317.20 (0.00 pct)     49781.02 (3.02 pct)
>  1024    50445.41 (0.00 pct)     51607.53 (2.30 pct)
> 
> ~~~~~~
> Stream
> ~~~~~~
> 
> - 10 runs
> 
> NPS1
> 
>               tip                     SIS_UTIL
>  Copy:   189113.11 (0.00 pct)    188490.27 (-0.32 pct)
> Scale:   201190.61 (0.00 pct)    204526.15 (1.65 pct)
>   Add:   232654.21 (0.00 pct)    234948.01 (0.98 pct)
> Triad:   226583.57 (0.00 pct)    228844.43 (0.99 pct)
> 
> NPS2
> 
> Test:         tip                     SIS_UTIL
>  Copy:   155347.14 (0.00 pct)    169386.29 (9.03 pct)
> Scale:   191701.53 (0.00 pct)    196110.51 (2.29 pct)
>   Add:   210013.97 (0.00 pct)    221088.45 (5.27 pct)
> Triad:   207602.00 (0.00 pct)    218072.52 (5.04 pct)
> 
> NPS4
> 
> Test:         tip                     SIS_UTIL
>  Copy:   136421.15 (0.00 pct)    140894.11 (3.27 pct)
> Scale:   191217.59 (0.00 pct)    190554.17 (-0.34 pct)
>   Add:   189229.52 (0.00 pct)    190871.88 (0.86 pct)
> Triad:   188052.99 (0.00 pct)    188417.63 (0.19 pct)
> 
> - 100 runs
> 
> NPS1
> 
> Test:       tip                     SIS_UTIL
>  Copy:   244693.32 (0.00 pct)    232328.05 (-5.05 pct)
> Scale:   221874.99 (0.00 pct)    216858.39 (-2.26 pct)
>   Add:   268363.89 (0.00 pct)    265449.16 (-1.08 pct)
> Triad:   260945.24 (0.00 pct)    252240.56 (-3.33 pct)
> 
> NPS2
> 
> Test:       tip                     SIS_UTIL
>  Copy:   211262.00 (0.00 pct)    225240.03 (6.61 pct)
> Scale:   222493.34 (0.00 pct)    219094.65 (-1.52 pct)
>   Add:   280277.17 (0.00 pct)    275677.73 (-1.64 pct)
> Triad:   265860.49 (0.00 pct)    262584.22 (-1.23 pct)
> 
> NPS4
> 
> Test:       tip                     SIS_UTIL
>  Copy:   250171.40 (0.00 pct)    230983.60 (-7.66 pct)
> Scale:   222293.56 (0.00 pct)    215984.34 (-2.83 pct)
>   Add:   279222.16 (0.00 pct)    270402.64 (-3.15 pct)
> Triad:   262013.92 (0.00 pct)    254820.60 (-2.74 pct)
> 
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
> 
> NPS1
> 
> sched-tip:      303718.33 (var: 1.31)
> SIS_UTIL:       303529.33 (var: 0.67)    (-0.06%)
> 
> NPS2
> 
> sched-tip:      304536.33 (var: 2.46)
> SIS_UTIL:       303730.33 (var: 1.57)    (-0.26%)
> 
> NPS4
> 
> sched-tip:      301192.33 (var: 1.81)
> SIS_UTIL:       300101.33 (var: 0.35)   (-0.36%)
> 
> ~~~~~~~~~~~~~~~~~~
> 
> Notes:
> 
> - There seems to be some noticeable regression for hackbench
>   with 16 groups in NPS1 mode.
Did the hackbench use the default fd number(20) in every group? If
this is the case, then there are 16 * 20 * 2 = 640 threads in the
system. I thought this should be overloaded, either in SIS_PROP or
SIS_UTIL, the search depth might be 4 and 0 respectively. And it
is also very likely the SIS_PROP will not find an idle CPU after
searching for 4 CPUs. So in theory there should be not much performance
difference with vs without the patch applied. But if the fd number is set
to a smaller one, the regression could be explained as you mentioned,
SIS_PROP search more aggressively.
> - There seems to be regression in tbench for case with number
>   of workers in range 32-128 (12.5% loaded to 50% loaded)
> - tbench reaches saturation early when system is fully loaded
> 
> This probably show that the strategy in the initial v1 RFC
> seems to work better with our LLC where number of CPUs per LLC
> is low compared to systems with unified LLC. Given this is
> showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
> can be enabled based on the the size of LLC.
> 
Yes, SIS_PROP searches more aggressively, but we attempts to replace
SIS_PROP with a more accurate policy.
> > [..snip..]
> >
> > [3]
> > Prateek mentioned that we should scan aggressively in an LLC domain
> > with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> > negligible. The current patch aims to propose a generic solution and only
> > considers the util_avg. A follow-up change could enhance the scan policy
> > to adjust the scan_percent according to the CPU number in LLC.
> 
> Following are some additional numbers I would like to share comparing SIS_PROP and
> SIS_UTIL:
> 
Nice analysis.
> o Hackbench with 1 group
> 
> With 1 group, following are the chances of SIS_PROP
> and SIS_UTIL finding an idle CPU when an idle CPU
> exists in LLC:
> 
> +-----------------+---------------------------+---------------------------+--------+
> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count  |
> +-----------------+---------------------------+---------------------------+--------+
> |        1        |             0             |             0             | 66444  |
> |        1        |             0             |             1             | 34153  |
> |        1        |             1             |             0             | 57204  |
> |        1        |             1             |             1             | 119263 |
> +-----------------+---------------------------+---------------------------+--------+
> 
So SIS_PROP searches more, and get higher chance to find an idle CPU in a LLC with
16 CPUs.
> SIS_PROP vs no SIS_PROP CPU search stats:
> 
> Total time without SIS_PROP: 90653653
> Total time with SIS_PROP: 53558942 (-40.92 pct)
> Total time saved: 37094711
> 
What does no SIS_PROP mean? Is it with SIS_PROP disabled and
SIS_UTIL enabled? Or with both SIS_PROP and SIS_UTIL disabled?
If it is the latter, is there any performance difference between
the two?
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
> 
> +--------------+-------+
> | CPU Searched | Count |
> +--------------+-------+
> |      0       | 10520 |
> |      1       |  7770 |
> |      2       | 11976 |
> |      3       | 17554 |
> |      4       | 13932 |
> |      5       | 15051 |
> |      6       |  8398 |
> |      7       |  4544 |
> |      8       |  3712 |
> |      9       |  2337 |
> |      10      |  4541 |
> |      11      |  1947 |
> |      12      |  3846 |
> |      13      |  3645 |
> |      14      |  2686 |
> |      15      |  8390 |
> |      16      | 26157 |
> +--------------+-------+
> 
> - SIS_UTIL might be bailing out too early in some of these cases.
>
Right. 
> o Hackbench with 16 group
> 
> the success rate looks as follows:
> 
> +-----------------+---------------------------+---------------------------+---------+
> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU |  Count  |
> +-----------------+---------------------------+---------------------------+---------+
> |        1        |             0             |             0             | 1313745 |
> |        1        |             0             |             1             |  694132 |
> |        1        |             1             |             0             | 2888450 |
> |        1        |             1             |             1             | 5343065 |
> +-----------------+---------------------------+---------------------------+---------+
> 
> SIS_PROP vs no SIS_PROP CPU search stats:
> 
> Total time without SIS_PROP: 5227299388
> Total time with SIS_PROP: 3866575188 (-26.03 pct)
> Total time saved: 1360724200
> 
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
> 
> +--------------+---------+
> | CPU Searched |  Count  |
> +--------------+---------+
> |      0       |  150351 |
> |      1       |  105116 |
> |      2       |  214291 |
> |      3       |  440053 |
> |      4       |  914116 |
> |      5       | 1757984 |
> |      6       | 2410484 |
> |      7       | 1867668 |
> |      8       |  379888 |
> |      9       |  84055  |
> |      10      |  55389  |
> |      11      |  26795  |
> |      12      |  43113  |
> |      13      |  24579  |
> |      14      |  32896  |
> |      15      |  70059  |
> |      16      |  150858 |
> +--------------+---------+
> 
> - SIS_UTIL might be bailing out too early in most of these cases
> 
It might be interesting to see what the current sum of util_avg is, and this suggested that,
even if util_avg is a little high, it might be still be worthwhile to search more CPUs.
> o tbench with 256 workers
> 
> For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
> of CPUs to search.
> 
> SIS_PROP vs no SIS_PROP CPU search stats:
> 
> Total time without SIS_PROP: 64004752959
> Total time with SIS_PROP: 34695004390 (-45.79 pct)
> Total time saved: 29309748569
> 
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
> 
> +--------------+----------+
> | CPU Searched |  Count   |
> +--------------+----------+
> |      0       |  500077  |
> |      1       |  543865  |
> |      2       | 4257684  |
> |      3       | 27457498 |
> |      4       | 40208673 |
> |      5       | 3264358  |
> |      6       |  191631  |
> |      7       |  24658   |
> |      8       |   2469   |
> |      9       |   1374   |
> |      10      |   2008   |
> |      11      |   1300   |
> |      12      |   1226   |
> |      13      |   1179   |
> |      14      |   1631   |
> |      15      |  11678   |
> |      16      |   7793   |
> +--------------+----------+
> 
> - This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
>   at restricting search space well.
> 
> o Observations
> 
> SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
> in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
> with 16 groups compared to than with 1 group.
> Also SIS_PROP is more aggressive at saving time for 1-group compared to the
> case with 16-groups.
> 
> The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
> performance gain in a fully loaded system.
> 
> Note: There might be some inaccuracies for the numbers presented for metrics that
> directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
> when gathering these data points and the results from SIS_PROP were returned from
> search_idle_cpu().
Do you mean the 'CPU Searched' calculated by SIS_PROP was collected with both SIS_UTIL
and SIS_PROP enabled?
> All the numbers for the above analysis were gathered in NPS1 mode.
> 
I'm thinking of taking nr_llc number into consideration to adjust the search depth,
something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd52fc5a034b..39b914599dce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
        llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
        nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
        nr_scan = max(nr_scan, 0);
+       if (nr_llc <= 16 && nr_scan)
+               nr_scan = nr_llc;
+
        WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
 }

I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.

thanks,
Chenyu