linux-kernel - Re: [PATCH 00/10] steal tasks to improve CPU utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <DM6PR18MB246096BE8C516A5A5D125932D98E0@DM6PR18MB2460.namprd18.prod.outlook.com>
Date:   Fri, 4 Jan 2019 13:37:17 +0000
From:   Shijith Thotton <sthotton@...vell.com>
To:     Steve Sistare <steven.sistare@...cle.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>
CC:     "subhra.mazumdar@...cle.com" <subhra.mazumdar@...cle.com>,
        "dhaval.giani@...cle.com" <dhaval.giani@...cle.com>,
        "rohit.k.jain@...cle.com" <rohit.k.jain@...cle.com>,
        "daniel.m.jordan@...cle.com" <daniel.m.jordan@...cle.com>,
        "pavel.tatashin@...rosoft.com" <pavel.tatashin@...rosoft.com>,
        "matt@...eblueprint.co.uk" <matt@...eblueprint.co.uk>,
        "umgwanakikbuti@...il.com" <umgwanakikbuti@...il.com>,
        "riel@...hat.com" <riel@...hat.com>,
        "jbacik@...com" <jbacik@...com>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Ganapatrao Kulkarni <gkulkarni@...vell.com>,
        Jayachandran Chandrasekharan Nair <jnair@...vell.com>
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 22-Oct-18 8:40 PM, Steve Sistare wrote:
> 
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates.  To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed.  For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
> 
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle.  idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy.  Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
> 
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
> 
> Patch 1 defines the sparsemask type and its operations.
> 
> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
> 
> Patches 5 and 6 refactor existing code for a cleaner merge of later
>    patches.
> 
> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
> 
> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> time being because of performance regressions that are not due to stealing
> per-se.  See the patch description for details.
> 
> Patch 10 adds schedstats for comparing the new behavior to the old, and
>    provided as a convenience for developers only, not for integration.
> 
> The patch series is based on kernel 4.19.0-rc7.  It compiles, boots, and
> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
> bandwidth control were tested.
> 
> Stealing imprroves utilization with only a modest CPU overhead in scheduler
> code.  In the following experiment, hackbench is run with varying numbers
> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> for each run, averaged per CPU, augmented with these non-standard stats:
> 
>    %find - percent of time spent in old and new functions that search for
>      idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
> 
>    steal - number of times a task is stolen from another CPU.
> 
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> hackbench <grps> process 100000
> sched_wakeup_granularity_ns=15000000
> 
>    baseline
>    grps  time  %busy  slice   sched   idle     wake %find  steal
>    1    8.084  75.02   0.10  105476  46291    59183  0.31      0
>    2   13.892  85.33   0.10  190225  70958   119264  0.45      0
>    3   19.668  89.04   0.10  263896  87047   176850  0.49      0
>    4   25.279  91.28   0.10  322171  94691   227474  0.51      0
>    8   47.832  94.86   0.09  630636 144141   486322  0.56      0
> 
>    new
>    grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
>    1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
>    2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
>    3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
>    4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
>    8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
> 
> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> The cost is at most 0.4% more find time.
> 
> Additional performance results follow.  A negative "speedup" is a
> regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
> is set to 15 msec.  Otherwise, preemptions increase at higher loads and
> distort the comparison between baseline and new.
> 
> ------------------ 1 Socket Results ------------------
> 
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   8.008    0.1   5.905    0.2      35.6
>         2  13.814    0.2  11.438    0.1      20.7
>         3  19.488    0.2  16.919    0.1      15.1
>         4  25.059    0.1  22.409    0.1      11.8
>         8  47.478    0.1  44.221    0.1       7.3
> 
> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   4.586    0.8   4.596    0.6      -0.3
>         2   7.693    0.2   5.775    1.3      33.2
>         3  10.442    0.3   8.288    0.3      25.9
>         4  13.087    0.2  11.057    0.1      18.3
>         8  24.145    0.2  22.076    0.3       9.3
>        16  43.779    0.1  41.741    0.2       4.8
> 
> KVM 4-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> tbench, average of 11 runs.
> 
>    clients    %speedup
>          1        16.2
>          2        11.7
>          4         9.9
>          8        12.8
>         16        13.7
> 
> KVM 2-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> 
>    Benchmark                     %speedup
>    specjbb2015_critical_jops          5.7
>    mysql_sysb1.0.14_mutex_2          40.6
>    mysql_sysb1.0.14_oltp_2            3.9
> 
> ------------------ 2 Socket Results ------------------
> 
> X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   7.945    0.2   7.219    8.7      10.0
>         2   8.444    0.4   6.689    1.5      26.2
>         3  12.100    1.1   9.962    2.0      21.4
>         4  15.001    0.4  13.109    1.1      14.4
>         8  27.960    0.2  26.127    0.3       7.0
> 
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
> 
>              --- base --    --- new ---
>    groups    time %stdev    time %stdev  %speedup
>         1   5.826    5.4   5.840    5.0      -0.3
>         2   5.041    5.3   6.171   23.4     -18.4
>         3   6.839    2.1   6.324    3.8       8.1
>         4   8.177    0.6   7.318    3.6      11.7
>         8  14.429    0.7  13.966    1.3       3.3
>        16  26.401    0.3  25.149    1.5       4.9
> 
> 
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Oracle database OLTP, logging disabled, NVRAM storage
> 
>    Customers   Users   %speedup
>      1200000      40       -1.2
>      2400000      80        2.7
>      3600000     120        8.9
>      4800000     160        4.4
>      6000000     200        3.0
> 
> X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
> Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
> Results from the Oracle "Performance PIT".
> 
>    Benchmark                                           %speedup
> 
>    mysql_sysb1.0.14_fileio_56_rndrd                        19.6
>    mysql_sysb1.0.14_fileio_56_seqrd                        12.1
>    mysql_sysb1.0.14_fileio_56_rndwr                         0.4
>    mysql_sysb1.0.14_fileio_56_seqrewr                      -0.3
> 
>    pgsql_sysb1.0.14_fileio_56_rndrd                        19.5
>    pgsql_sysb1.0.14_fileio_56_seqrd                         8.6
>    pgsql_sysb1.0.14_fileio_56_rndwr                         1.0
>    pgsql_sysb1.0.14_fileio_56_seqrewr                       0.5
> 
>    opatch_time_ASM_12.2.0.1.0_HP2M                          7.5
>    select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M             5.1
>    select-1_users_asmm_ASM_12.2.0.1.0_HP2M                  4.4
>    swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M           5.8
> 
>    lm3_memlat_L2                                            4.8
>    lm3_memlat_L1                                            0.0
> 
>    ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching     60.1
>    ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent        5.2
>    ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent       -3.0
>    ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
> 
> X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> 
>    NAS_OMP
>    bench class   ncpu    %improved(Mops)
>    dc    B       72      1.3
>    is    C       72      0.9
>    is    D       72      0.7
> 
>    sysbench mysql, average of 24 runs
>            --- base ---     --- new ---
>    nthr   events  %stdev   events  %stdev %speedup
>       1    331.0    0.25    331.0    0.24     -0.1
>       2    661.3    0.22    661.8    0.22      0.0
>       4   1297.0    0.88   1300.5    0.82      0.2
>       8   2420.8    0.04   2420.5    0.04     -0.1
>      16   4826.3    0.07   4825.4    0.05     -0.1
>      32   8815.3    0.27   8830.2    0.18      0.1
>      64  12823.0    0.24  12823.6    0.26      0.0
> 
> --------------------------------------------------------------
> 
> Steve Sistare (10):
>    sched: Provide sparsemask, a reduced contention bitmap
>    sched/topology: Provide hooks to allocate data shared per LLC
>    sched/topology: Provide cfs_overload_cpus bitmap
>    sched/fair: Dynamically update cfs_overload_cpus
>    sched/fair: Hoist idle_stamp up from idle_balance
>    sched/fair: Generalize the detach_task interface
>    sched/fair: Provide can_migrate_task_llc
>    sched/fair: Steal work from an overloaded CPU when CPU goes idle
>    sched/fair: disable stealing if too many NUMA nodes
>    sched/fair: Provide idle search schedstats
> 
>   include/linux/sched/topology.h |   1 +
>   include/linux/sparsemask.h     | 260 +++++++++++++++++++++++++++++++
>   kernel/sched/core.c            |  30 +++-
>   kernel/sched/fair.c            | 338 +++++++++++++++++++++++++++++++++++++----
>   kernel/sched/features.h        |   6 +
>   kernel/sched/sched.h           |  13 +-
>   kernel/sched/stats.c           |  11 +-
>   kernel/sched/stats.h           |  13 ++
>   kernel/sched/topology.c        | 117 +++++++++++++-
>   lib/Makefile                   |   2 +-
>   lib/sparsemask.c               | 142 +++++++++++++++++
>   11 files changed, 898 insertions(+), 35 deletions(-)
>   create mode 100644 include/linux/sparsemask.h
>   create mode 100644 lib/sparsemask.c
> 
> --
> 1.8.3.1
> 
> 

Hi Steve,

Tried your patchset on ThunderX2 with 2 nodes. Please find my observations below.

Hackbench was run on single node due to variance on 2 nodes and it showed 
improvement under load.

Single node hackbench numbers:
group    old time	new time	steals		%change
1	 6.717		7.275		21		-8.31
2	 8.449		9.268		106		-9.69
3	 12.035		12.761		173071		-6.03
4	 14.648		9.787		595889		 33.19
8	 22.513		18.329		2397394		 18.58
16	 39.861		36.263		3949903		 9.06

column "new time" shows hackbench runtime in seconds with the patchset.

Tried below benchmarks with 2 nodes, but no performance benefit/degradation was 
observed on multiple runs.
  - MySQL (read/write/PS etc with sysbench)
  - HHVM running oss-performance benchmarks

Shijith