[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200107095655.GF3466@techsingularity.net>
Date: Tue, 7 Jan 2020 09:56:55 +0000
From: Mel Gorman <mgorman@...hsingularity.net>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Phil Auld <pauld@...hat.com>,
Valentin Schneider <valentin.schneider@....com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Quentin Perret <quentin.perret@....com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Morten Rasmussen <Morten.Rasmussen@....com>,
Hillf Danton <hdanton@...a.com>,
Parth Shah <parth@...ux.ibm.com>,
Rik van Riel <riel@...riel.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance
between SD_NUMA domains v2
On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > This looks weird to me because you use imbalance_pct, which is
> > > meaningful only compare a ratio, to define a number that will be then
> > > compared to a number of tasks without taking into account the weight
> > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > NUMA level which is 125 AFAICT
> > >
> >
> > The intent in this version was to only cover the low utilisation case
> > regardless of the NUMA node size. There were too many corner cases
> > where the tradeoff of local memory latency versus local memory bandwidth
> > cannot be quantified. See Srikar's report as an example but it's a general
> > problem. The use of imbalance_pct was simply to find the smallest number of
> > running tasks were (imbalance_pct - 100) would be 1 running task and limit
>
> But using imbalance_pct alone doesn't mean anything.
Other than figuring out "how many running tasks are required before
imbalance_pct is roughly equivalent to one fully active CPU?". Even then,
it's a bit weak as imbalance_pct makes hard-coded assumptions on what
the tradeoff of cross-domain migration is without considering the hardware.
> Using similar to the below
>
> busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
>
> would be more meaningful
>
It's meaningful to some sense from a conceptual point of view but
setting the low utilisation cutoff depending on the number of CPUs on
the node does not account for any local memory latency vs bandwidth.
i.e. on a small or mid-range machine the cutoff will make sense. On
larger machines, the cutoff could be at the point where memory bandwidth
is saturated leading to a scenario whereby upgrading to a larger
machine performs worse than the smaller machine.
Much more importantly, doing what you suggest allows an imbalance
of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
caches. That is going to have side-effects. While I also do not account
for the LLC group_weight, it's unlikely the cut-off I used would be
smaller than an LLC cache on a large machine as the cache.
These two points are why I didn't take the group weight into account.
Now if you want, I can do what you suggest anyway as long as you are happy
that the child domain weight is also taken into account and to bound the
largest possible allowed imbalance to deal with the case of a node having
multiple small LLC caches. That means that some machines will be using the
size of the node and some machines will use the size of an LLC. It's less
predictable overall as some machines will be "special" relative to others
making it harder to reproduce certain problems locally but it would take
imbalance_pct into account in a way that you're happy with.
Also bear in mind that whether LLC is accounted for or not, the final
result should be halved similar to the other imbalance calculations to
avoid over or under load balancing.
> Or you could use the util_avg so you will take into account if the
> tasks are short running ones or long running ones
>
util_avg can be skewed if there are big outliers. Even then, it's not
a great metric for the low utilisation cutoff. Large numbers of mostly
idle but running tasks would be treated similarly to small numbers of
fully active tasks. It's less predictable and harder to reason about how
load balancing behaves across a variety of workloads.
Based on what you suggest, the result looks like this (build tested
only)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..1b2c7bed2db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/*
* Try to use spare capacity of local group without overloading it or
* emptying busiest.
- * XXX Spreading tasks across NUMA nodes is not always the best policy
- * and special care should be taken for SD_NUMA domain level before
- * spreading the tasks. For now, load_balance() fully relies on
- * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
*/
if (local->group_type == group_has_spare) {
if (busiest->group_type > group_fully_busy) {
@@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
env->imbalance = nr_diff >> 1;
- return;
- }
+ } else {
- /*
- * If there is no overload, we just want to even the number of
- * idle cpus.
- */
- env->migration_type = migrate_task;
- env->imbalance = max_t(long, 0, (local->idle_cpus -
+ /*
+ * If there is no overload, we just want to even the number of
+ * idle cpus.
+ */
+ env->migration_type = migrate_task;
+ env->imbalance = max_t(long, 0, (local->idle_cpus -
busiest->idle_cpus) >> 1);
+ }
+
+ /* Consider allowing a small imbalance between NUMA groups */
+ if (env->sd->flags & SD_NUMA) {
+ struct sched_domain *child = env->sd->child;
+ unsigned int imbalance_adj;
+
+ /*
+ * Calculate an acceptable degree of imbalance based
+ * on imbalance_adj. However, do not allow a greater
+ * imbalance than the child domains weight to avoid
+ * a case where the allowed imbalance spans multiple
+ * LLCs.
+ */
+ imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;
+ imbalance_adj = min(imbalance_adj, child->span_weight);
+ imbalance_adj >>= 1;
+
+ /*
+ * Ignore small imbalances when the busiest group has
+ * low utilisation.
+ */
+ if (busiest->sum_nr_running < imbalance_adj)
+ env->imbalance = 0;
+ }
+
return;
}
Powered by blists - more mailing lists