linux-kernel - Re: [PATCH v4 00/10] steal tasks to improve CPU utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a9aeb65-7b68-ebc3-d6c7-e907efe6ed2a@oracle.com>
Date:   Thu, 31 Jan 2019 09:16:46 -0800
From:   Dhaval Giani <dhaval.giani@...cle.com>
To:     Steven Sistare <steven.sistare@...cle.com>, mingo@...hat.com,
        peterz@...radead.org
Cc:     subhra.mazumdar@...cle.com, daniel.m.jordan@...cle.com,
        pavel.tatashin@...rosoft.com, matt@...eblueprint.co.uk,
        umgwanakikbuti@...il.com, riel@...hat.com, jbacik@...com,
        juri.lelli@...hat.com, valentin.schneider@....com,
        vincent.guittot@...aro.org, quentin.perret@....com,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v4 00/10] steal tasks to improve CPU utilization


> 
> On 12/6/2018 4:28 PM, Steve Sistare wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates.  To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed.  For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This simple stealing yields a higher CPU utilization than idle_balance()
>> alone, because the search is cheap, so it may be called every time the CPU
>> is about to go idle.  idle_balance() does more work because it searches
>> widely for the busiest queue, so to limit its CPU consumption, it declines
>> to search if the system is too busy.  Simple stealing does not offload the
>> globally busiest queue, but it is much better than running nothing at all.
>>
>> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
>> reduce cache contention vs the usual bitmap when many threads concurrently
>> set, clear, and visit elements.
>>
>> Patch 1 defines the sparsemask type and its operations.
>>
>> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>>
>> Patches 5 and 6 refactor existing code for a cleaner merge of later
>>   patches.
>>
>> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>>
>> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
>> time being because of performance regressions that are not due to stealing
>> per-se.  See the patch description for details.
>>
>> Patch 10 adds schedstats for comparing the new behavior to the old, and
>>   provided as a convenience for developers only, not for integration.
>>
>> The patch series is based on kernel 4.20.0-rc1.  It compiles, boots, and
>> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
>> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
>> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
>> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
>> bandwidth control were tested.
>>
>> Stealing improves utilization with only a modest CPU overhead in scheduler
>> code.  In the following experiment, hackbench is run with varying numbers
>> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
>> for each run, averaged per CPU, augmented with these non-standard stats:
>>
>>   %find - percent of time spent in old and new functions that search for
>>     idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>>
>>   steal - number of times a task is stolen from another CPU.
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> hackbench <grps> process 100000
>> sched_wakeup_granularity_ns=15000000
>>
>>   baseline
>>   grps  time  %busy  slice   sched   idle     wake %find  steal
>>   1    8.084  75.02   0.10  105476  46291    59183  0.31      0
>>   2   13.892  85.33   0.10  190225  70958   119264  0.45      0
>>   3   19.668  89.04   0.10  263896  87047   176850  0.49      0
>>   4   25.279  91.28   0.10  322171  94691   227474  0.51      0
>>   8   47.832  94.86   0.09  630636 144141   486322  0.56      0
>>
>>   new
>>   grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
>>   1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
>>   2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
>>   3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
>>   4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
>>   8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
>>
>> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
>> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
>> The cost is at most 0.4% more find time.
>>
>> Additional performance results follow.  A negative "speedup" is a
>> regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
>> is set to 15 msec.  Otherwise, preemptions increase at higher loads and
>> distort the comparison between baseline and new.
>>
>> ------------------ 1 Socket Results ------------------
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench <groups> process 100000
>>
>>             --- base --    --- new ---
>>   groups    time %stdev    time %stdev  %speedup
>>        1   8.008    0.1   5.905    0.2      35.6
>>        2  13.814    0.2  11.438    0.1      20.7
>>        3  19.488    0.2  16.919    0.1      15.1
>>        4  25.059    0.1  22.409    0.1      11.8
>>        8  47.478    0.1  44.221    0.1       7.3
>>
>> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
>> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench <groups> process 100000
>>
>>             --- base --    --- new ---
>>   groups    time %stdev    time %stdev  %speedup
>>        1   4.586    0.8   4.596    0.6      -0.3
>>        2   7.693    0.2   5.775    1.3      33.2
>>        3  10.442    0.3   8.288    0.3      25.9
>>        4  13.087    0.2  11.057    0.1      18.3
>>        8  24.145    0.2  22.076    0.3       9.3
>>       16  43.779    0.1  41.741    0.2       4.8
>>
>> KVM 4-cpu
>> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>> tbench, average of 11 runs.
>>
>>   clients    %speedup
>>         1        16.2
>>         2        11.7
>>         4         9.9
>>         8        12.8
>>        16        13.7
>>
>> KVM 2-cpu
>> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>>
>>   Benchmark                     %speedup
>>   specjbb2015_critical_jops          5.7
>>   mysql_sysb1.0.14_mutex_2          40.6
>>   mysql_sysb1.0.14_oltp_2            3.9
>>
>> ------------------ 2 Socket Results ------------------
>>
>> X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench <groups> process 100000
>>
>>             --- base --    --- new ---
>>   groups    time %stdev    time %stdev  %speedup
>>        1   7.945    0.2   7.219    8.7      10.0
>>        2   8.444    0.4   6.689    1.5      26.2
>>        3  12.100    1.1   9.962    2.0      21.4
>>        4  15.001    0.4  13.109    1.1      14.4
>>        8  27.960    0.2  26.127    0.3       7.0
>>
>> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
>> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench <groups> process 100000
>>
>>             --- base --    --- new ---
>>   groups    time %stdev    time %stdev  %speedup
>>        1   5.826    5.4   5.840    5.0      -0.3
>>        2   5.041    5.3   6.171   23.4     -18.4
>>        3   6.839    2.1   6.324    3.8       8.1
>>        4   8.177    0.6   7.318    3.6      11.7
>>        8  14.429    0.7  13.966    1.3       3.3
>>       16  26.401    0.3  25.149    1.5       4.9
>>
>>
>> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
>> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> Oracle database OLTP, logging disabled, NVRAM storage
>>
>>   Customers   Users   %speedup
>>     1200000      40       -1.2
>>     2400000      80        2.7
>>     3600000     120        8.9
>>     4800000     160        4.4
>>     6000000     200        3.0
>>
>> X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
>> Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
>> Results from the Oracle "Performance PIT".
>>
>>   Benchmark                                           %speedup
>>
>>   mysql_sysb1.0.14_fileio_56_rndrd                        19.6
>>   mysql_sysb1.0.14_fileio_56_seqrd                        12.1
>>   mysql_sysb1.0.14_fileio_56_rndwr                         0.4
>>   mysql_sysb1.0.14_fileio_56_seqrewr                      -0.3
>>
>>   pgsql_sysb1.0.14_fileio_56_rndrd                        19.5
>>   pgsql_sysb1.0.14_fileio_56_seqrd                         8.6
>>   pgsql_sysb1.0.14_fileio_56_rndwr                         1.0
>>   pgsql_sysb1.0.14_fileio_56_seqrewr                       0.5
>>
>>   opatch_time_ASM_12.2.0.1.0_HP2M                          7.5
>>   select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M             5.1
>>   select-1_users_asmm_ASM_12.2.0.1.0_HP2M                  4.4
>>   swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M           5.8
>>
>>   lm3_memlat_L2                                            4.8
>>   lm3_memlat_L1                                            0.0
>>
>>   ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching     60.1
>>   ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent        5.2
>>   ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent       -3.0
>>   ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
>>
>> X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
>> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>>
>>   NAS_OMP
>>   bench class   ncpu    %improved(Mops)
>>   dc    B       72      1.3
>>   is    C       72      0.9
>>   is    D       72      0.7
>>
>>   sysbench mysql, average of 24 runs
>>           --- base ---     --- new ---
>>   nthr   events  %stdev   events  %stdev %speedup
>>      1    331.0    0.25    331.0    0.24     -0.1
>>      2    661.3    0.22    661.8    0.22      0.0
>>      4   1297.0    0.88   1300.5    0.82      0.2
>>      8   2420.8    0.04   2420.5    0.04     -0.1
>>     16   4826.3    0.07   4825.4    0.05     -0.1
>>     32   8815.3    0.27   8830.2    0.18      0.1
>>     64  12823.0    0.24  12823.6    0.26      0.0
>>
>> --------------------------------------------------------------
>>
>> Changes from v1 to v2:
>>   - Remove stray find_time hunk from patch 5
>>   - Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
>>   - Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2
>>   - Steal iff avg_idle exceeds the cost of stealing
>>
>> Changes from v2 to v3:
>>   - Update series for kernel 4.20.  Context changes only.
>>
>> Changes from v3 to v4:
>>   - Avoid 64-bit division on 32-bit processors in compute_skid()
>>   - Replace IF_SMP with inline functions to set idle_stamp
>>   - Push ZALLOC_MASK body into calling function
>>   - Set rq->cfs_overload_cpus in update_top_cache_domain instead of
>>     cpu_attach_domain
>>   - Rewrite sparsemask iterator for complete inlining
>>   - Cull and clean up sparsemask functions and moved all into
>>     sched/sparsemask.h
>>
>> Steve Sistare (10):
>>   sched: Provide sparsemask, a reduced contention bitmap
>>   sched/topology: Provide hooks to allocate data shared per LLC
>>   sched/topology: Provide cfs_overload_cpus bitmap
>>   sched/fair: Dynamically update cfs_overload_cpus
>>   sched/fair: Hoist idle_stamp up from idle_balance
>>   sched/fair: Generalize the detach_task interface
>>   sched/fair: Provide can_migrate_task_llc
>>   sched/fair: Steal work from an overloaded CPU when CPU goes idle
>>   sched/fair: disable stealing if too many NUMA nodes
>>   sched/fair: Provide idle search schedstats
>>
>>  include/linux/sched/topology.h |   1 +
>>  kernel/sched/core.c            |  31 +++-
>>  kernel/sched/fair.c            | 354 +++++++++++++++++++++++++++++++++++++----
>>  kernel/sched/features.h        |   6 +
>>  kernel/sched/sched.h           |  13 +-
>>  kernel/sched/sparsemask.h      | 210 ++++++++++++++++++++++++
>>  kernel/sched/stats.c           |  11 +-
>>  kernel/sched/stats.h           |  13 ++
>>  kernel/sched/topology.c        | 121 +++++++++++++-
>>  9 files changed, 726 insertions(+), 34 deletions(-)
>>  create mode 100644 kernel/sched/sparsemask.h
>>

On 2019-01-14 8:55 a.m., Steven Sistare wrote:> Hi Peter and Ingo,
>   I am waiting for one of you to review, ack, or reject this series. I
> have addressed all known issues.  I have a reviewed-by from Valentin
> and a
> tested-by from Vincent which I will add to v5 if you approve the
> patch.
>

Hi Thomas, Peter, Ingo,

These patches have been around for a bit and have been reviewed by
others. Could one of you please take a look at them, and let us know if
we are heading in the right direction?

Dhaval