linux-kernel - Re: [PATCH 3/3] sched/numa: Limit the amount of imbalance that can exist at fork time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201117172847.GC3371@techsingularity.net>
Date:   Tue, 17 Nov 2020 17:28:47 +0000
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...nel.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Juri Lelli <juri.lelli@...hat.com>
Subject: Re: [PATCH 3/3] sched/numa: Limit the amount of imbalance that can
 exist at fork time

On Tue, Nov 17, 2020 at 04:53:10PM +0100, Vincent Guittot wrote:
> On Tue, 17 Nov 2020 at 16:17, Mel Gorman <mgorman@...hsingularity.net> wrote:
> >
> > On Tue, Nov 17, 2020 at 03:31:19PM +0100, Vincent Guittot wrote:
> > > On Tue, 17 Nov 2020 at 15:18, Peter Zijlstra <peterz@...radead.org> wrote:
> > > >
> > > > On Tue, Nov 17, 2020 at 01:42:22PM +0000, Mel Gorman wrote:
> > > > > -                     if (local_sgs.idle_cpus)
> > > > > +                     if (local_sgs.idle_cpus >= (sd->span_weight >> 2))
> > > > >                               return NULL;
> > > >
> > > > Is that the same 25% ?
> > >
> > > same question for me
> >
> > It's the same 25%. It's in the comment as -- utilisation is not too high
> 
> utilization is misleading, it usually refers to pelt utilization
> whereas whet you check is the number of busy cpus
> 

Will fix.

> > where "high" is related to adjust_numa_imbalance.
> >
> > > could we encapsulate this 25% allowed imbalance like for adjust_numa_imbalance
> >
> > Using adjust_numa_imbalance() directly in this context would be awkward
> 
> Would be good to use the same function to say if we allow or not the imbalance
> 
> something like numa_allow_imbalance(sg_lb_stats * group_stats)
> 

Also can be done.

> which would return how much margin remains available before not
> allowing the imbalance
> 

Easier to just make it a bool as the available margin is not relevant
(yet).

> 
> > but the threshold could be shared with something like the additional
> > diff below. Is that what you had in mind or something different?
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index adfab218a498..49ef3484c56c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5878,6 +5878,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >  static struct sched_group *
> >  find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
> >
> > +static inline int numa_imbalance_threshold(int weight);
> > +
> >  /*
> >   * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
> >   */
> > @@ -8894,7 +8896,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >                          * If there is a real need of migration, periodic load
> >                          * balance will take care of it.
> >                          */
> > -                       if (local_sgs.idle_cpus >= (sd->span_weight >> 2))
> 
> also here you use idle_cpus and on the other part you use nr_running.
> Can't we use the same metrics on both sides?
> 

We can. Basing it on sum_nr_running is only an approximation of the number
of busy CPUs if tasks are bound to a subset of CPUs but it would mean that
the fork spreading decision is in line with adjust_numa_imbalance(). That
is a sensible starting point and we could pass in
(sd->span_weight - local_sgs.idle_cpus into numa_allow_imbalance()) if
there was strong justification for it.

Untested patch currently looks like

---
 kernel/sched/fair.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33709dfac24d..4c8a3b570b0a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1550,6 +1550,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
+static inline bool allow_numa_imbalance(int dst_running, int dst_weight);
 static inline long adjust_numa_imbalance(int imbalance,
 					int dst_running, int dst_weight);
 
@@ -8779,9 +8780,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			.group_type = group_overloaded,
 	};
 
-	imbalance = scale_load_down(NICE_0_LOAD) *
-				(sd->imbalance_pct-100) / 100;
-
 	do {
 		int local_group;
 
@@ -8835,6 +8833,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	switch (local_sgs.group_type) {
 	case group_overloaded:
 	case group_fully_busy:
+
+		/* Calculate allowed imbalance based on load */
+		imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
+
 		/*
 		 * When comparing groups across NUMA domains, it's possible for
 		 * the local domain to be very lightly loaded relative to the
@@ -8891,7 +8894,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (local_sgs.idle_cpus)
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
 				return NULL;
 		}
 
@@ -8995,9 +8998,22 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 #define NUMA_IMBALANCE_MIN 2
 
+/*
+ * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
+ * This is an approximation as the number of running tasks may not be
+ * related to the number of busy CPUs due to sched_setaffinity.
+ */
+static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+{
+	return (dst_running < (dst_weight >> 2));
+}
+
 static inline long adjust_numa_imbalance(int imbalance,
 				int dst_running, int dst_weight)
 {
+	if (!allow_numa_imbalance(dst_running, dst_weight))
+		return imbalance;
+
 	/*
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * when the destination is lightly loaded so that pairs of