linux-kernel - Re: [PATCH 4/4] sched/fair: Track possibly overloaded domains and abort a scan if necessary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAUuO1Jp6P73gAiP+g5iLTx16UeBgBjm_5zjFxwiBD9=Q@mail.gmail.com>
Date:   Fri, 20 Mar 2020 16:48:39 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Phil Auld <pauld@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/4] sched/fair: Track possibly overloaded domains and
 abort a scan if necessary

On Fri, 20 Mar 2020 at 16:13, Mel Gorman <mgorman@...hsingularity.net> wrote:
>
> Once a domain is overloaded, it is very unlikely that a free CPU will
> be found in the short term but there is still potentially a lot of
> scanning. This patch tracks if a domain may be overloaded due to an
> excessive number of running tasks relative to available CPUs.  In the
> event a domain is overloaded, a search is aborted.
>
> This has a variable impact on performance for hackbench which often
> is overloaded on the test machines used. There was a mix of performance gains
> and losses but there is a substantial impact on search efficiency.
>
> On a 2-socket broadwell machine with 80 cores in total, tbench showed
> small gains and some losses
>
> Hmean     1        431.51 (   0.00%)      426.53 *  -1.15%*
> Hmean     2        842.69 (   0.00%)      839.00 *  -0.44%*
> Hmean     4       1631.09 (   0.00%)     1634.81 *   0.23%*
> Hmean     8       3001.08 (   0.00%)     3020.85 *   0.66%*
> Hmean     16      5631.75 (   0.00%)     5655.04 *   0.41%*
> Hmean     32      9736.22 (   0.00%)     9645.68 *  -0.93%*
> Hmean     64     13978.54 (   0.00%)    15215.65 *   8.85%*
> Hmean     128    20093.06 (   0.00%)    19389.45 *  -3.50%*
> Hmean     256    17491.34 (   0.00%)    18616.32 *   6.43%*
> Hmean     320    17423.67 (   0.00%)    17793.38 *   2.12%*
>
> However, the "SIS Domain Search Efficiency" went from 6.03% to 19.61%
> indicating that far fewer CPUs were scanned. The impact of the patch
> is more noticable when sockets have multiple L3 caches. While true for
> EPYC 2nd generation, it's particularly noticable on EPYC 1st generation
>
> Hmean     1        325.30 (   0.00%)      324.92 *  -0.12%*
> Hmean     2        630.77 (   0.00%)      621.35 *  -1.49%*
> Hmean     4       1211.41 (   0.00%)     1148.51 *  -5.19%*
> Hmean     8       2017.29 (   0.00%)     1953.57 *  -3.16%*
> Hmean     16      4068.81 (   0.00%)     3514.06 * -13.63%*
> Hmean     32      5588.20 (   0.00%)     6583.58 *  17.81%*
> Hmean     64      8470.14 (   0.00%)    10117.26 *  19.45%*
> Hmean     128    11462.06 (   0.00%)    17207.68 *  50.13%*
> Hmean     256    11433.74 (   0.00%)    13446.93 *  17.61%*
> Hmean     512    12576.88 (   0.00%)    13630.08 *   8.37%*
>
> On this machine, search efficiency goes from 21.04% to 32.66%. There
> is a noticable problem at 16 when there are enough clients for a LLC
> domain to spill over.
>
> With hackbench, the overload problem is a bit more obvious. On the
> 2-socket broadwell machine using processes and pipes we see
>
> Amean     1        0.3023 (   0.00%)      0.2893 (   4.30%)
> Amean     4        0.6823 (   0.00%)      0.6930 (  -1.56%)
> Amean     7        1.0293 (   0.00%)      1.0380 (  -0.84%)
> Amean     12       1.6913 (   0.00%)      1.7027 (  -0.67%)
> Amean     21       2.9307 (   0.00%)      2.9297 (   0.03%)
> Amean     30       4.0040 (   0.00%)      4.0270 (  -0.57%)
> Amean     48       6.0703 (   0.00%)      6.1067 (  -0.60%)
> Amean     79       9.0630 (   0.00%)      9.1223 *  -0.65%*
> Amean     110     12.1917 (   0.00%)     12.1693 (   0.18%)
> Amean     141     15.7150 (   0.00%)     15.4187 (   1.89%)
> Amean     172     19.5327 (   0.00%)     18.9937 (   2.76%)
> Amean     203     23.3093 (   0.00%)     22.2497 *   4.55%*
> Amean     234     27.8657 (   0.00%)     25.9627 *   6.83%*
> Amean     265     32.9783 (   0.00%)     29.5240 *  10.47%*
> Amean     296     35.6727 (   0.00%)     32.8260 *   7.98%*
>
> More of the SIS stats are worth looking at in this case
>
> Ops SIS Domain Search       10390526707.00  9822163508.00
> Ops SIS Scanned            223173467577.00 48330226094.00
> Ops SIS Domain Scanned     222820381314.00 47964114165.00
> Ops SIS Failures            10183794873.00  9639912418.00
> Ops SIS Recent Used Hit        22194515.00    22517194.00
> Ops SIS Recent Used Miss     5733847634.00  5500415074.00
> Ops SIS Recent Attempts      5756042149.00  5522932268.00
> Ops SIS Search Efficiency             4.81          21.08
>
> Search efficiency goes from 4.66% to 20.48% but the SIS Domain Scanned
> shows the sheer volume of searching SIS does when prev, target and recent
> CPUs are unavailable.
>
> This could be much more aggressive by also cutting off a search for idle
> cores. However, to make that work properly requires a much more intrusive
> series that is likely to be controversial. This seemed like a reasonable
> tradeoff to tackle the most obvious problem with select_idle_cpu.
>
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 65 +++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/features.h        |  3 ++
>  3 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index af9319e4cfb9..76ec7a54f57b 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -66,6 +66,7 @@ struct sched_domain_shared {
>         atomic_t        ref;
>         atomic_t        nr_busy_cpus;
>         int             has_idle_cores;
> +       int             is_overloaded;

Can't nr_busy_cpus compared to sd->span_weight give you similar status  ?

>  };
>
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41913fac68de..31e011e627db 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5924,6 +5924,38 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>         return new_cpu;
>  }
>
> +static inline void
> +set_sd_overloaded(struct sched_domain_shared *sds, int val)
> +{
> +       if (!sds)
> +               return;
> +
> +       WRITE_ONCE(sds->is_overloaded, val);
> +}
> +
> +static inline bool test_sd_overloaded(struct sched_domain_shared *sds)
> +{
> +       return READ_ONCE(sds->is_overloaded);
> +}
> +
> +/* Returns true if a previously overloaded domain is likely still overloaded. */
> +static inline bool
> +abort_sd_overloaded(struct sched_domain_shared *sds, int prev, int target)
> +{
> +       if (!sds || !test_sd_overloaded(sds))
> +               return false;
> +
> +       /* Are either target or a suitable prev 1 or 0 tasks? */
> +       if (cpu_rq(target)->nr_running <= 1 ||
> +           (prev != target && cpus_share_cache(prev, target) &&
> +            cpu_rq(prev)->nr_running <= 1)) {
> +               set_sd_overloaded(sds, 0);
> +               return false;
> +       }
> +
> +       return true;
> +}
> +
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
> @@ -6060,15 +6092,18 @@ static inline int select_idle_smt(struct task_struct *p, int target)
>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>   * average idle time for this rq (as found in rq->avg_idle).
>   */
> -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
> +                          int prev, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>         struct sched_domain *this_sd;
> +       struct sched_domain_shared *sds;
>         u64 avg_cost, avg_idle;
>         u64 time, cost;
>         s64 delta;
>         int this = smp_processor_id();
>         int cpu, nr = INT_MAX;
> +       int nr_scanned = 0, nr_running = 0;
>
>         this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>         if (!this_sd)
> @@ -6092,18 +6127,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>                         nr = 4;
>         }
>
> +       sds = rcu_dereference(per_cpu(sd_llc_shared, target));
> +       if (sched_feat(SIS_OVERLOAD)) {
> +               if (abort_sd_overloaded(sds, prev, target))
> +                       return -1;
> +       }
> +
>         time = cpu_clock(this);
>
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 schedstat_inc(this_rq()->sis_scanned);
> -               if (!--nr)
> -                       return -1;
> +               if (!--nr) {
> +                       cpu = -1;
> +                       break;
> +               }
>                 if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>                         break;
> +               if (sched_feat(SIS_OVERLOAD)) {
> +                       nr_scanned++;
> +                       nr_running += cpu_rq(cpu)->nr_running;
> +               }
>         }
>
> +       /* Check if domain should be marked overloaded if no cpu was found. */
> +       if (sched_feat(SIS_OVERLOAD) && (signed)cpu >= nr_cpumask_bits &&
> +           nr_scanned && nr_running > (nr_scanned << 1)) {
> +               set_sd_overloaded(sds, 1);
> +       }
> +
> +       /* Scan cost not accounted for if scan is throttled */
> +       if (!nr)
> +               return -1;
> +
>         time = cpu_clock(this) - time;
>         cost = this_sd->avg_scan_cost;
>         delta = (s64)(time - cost) / 8;
> @@ -6236,7 +6293,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>         if ((unsigned)i < nr_cpumask_bits)
>                 return i;
>
> -       i = select_idle_cpu(p, sd, target);
> +       i = select_idle_cpu(p, sd, prev, target);
>         if ((unsigned)i < nr_cpumask_bits)
>                 return i;
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 7481cd96f391..c36ae01910e2 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -57,6 +57,9 @@ SCHED_FEAT(TTWU_QUEUE, true)
>  SCHED_FEAT(SIS_AVG_CPU, false)
>  SCHED_FEAT(SIS_PROP, true)
>
> +/* Limit scans if the domain is likely overloaded */
> +SCHED_FEAT(SIS_OVERLOAD, true)
> +
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
>   * in a single rq->lock section. Default disabled because the
> --
> 2.16.4
>