linux-kernel - Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20200521160408.GC7167@techsingularity.net>
Date:   Thu, 21 May 2020 17:04:08 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Hillf Danton <hdanton@...a.com>
Cc:     Jirka Hladky <jhladky@...hat.com>, Phil Auld <pauld@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Valentin Schneider <valentin.schneider@....com>,
        LKML <linux-kernel@...r.kernel.org>,
        Douglas Shakshober <dshaks@...hat.com>,
        Waiman Long <longman@...hat.com>,
        Joe Mario <jmario@...hat.com>, Bill Gray <bgray@...hat.com>,
        "aokuliar@...hat.com" <aokuliar@...hat.com>,
        "kkolakow@...hat.com" <kkolakow@...hat.com>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load
 balancer v6

On Thu, May 21, 2020 at 10:09:31PM +0800, Hillf Danton wrote:
> > I'm ignoring the coding style of c++ comments but minimally that should
> > be fixed. More importantly, node_type can be one of node_overloaded,
> > node_has_spare or node_fully busy and this is checking if there is a
> > mismatch. However, it's not taking into account whether the dst_node
> > is more or less loaded than the src and does not appear to be actually
> > checking spare capacity like the comment suggests.
> > 
>
> Type other than node_has_spare is not considered because node is not
> possible to be so idle that two communicating tasks would better
> "stay local."
> 

You hardcode an imbalance of 2 at the start without computing any
imbalance. Then if the source is fully_busy or overloaded while the
dst is idle, a task can move but that is based on an imaginary hard-coded
imbalance of 2. Finally, it appears that that the load balancer and
NUMA balancer are again using separate logic again when one part of the
objective of the series is that the load balancer and NUMA balancer would
not override each other. As the imbalance is never computed, the patch
can create one which the load balancer then overrides. What am I missing?

That said, it does make a certain amount of sense that if the dst has
spare capacity and the src is fully busy or overloaded then allow the
migration regardless of imbalance but that would be a different patch --
something like this but I didn't think too deeply or test it.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..97c0e090e161 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1916,19 +1916,27 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	 * imbalance that would be overruled by the load balancer.
 	 */
 	if (env->dst_stats.node_type == node_has_spare) {
-		unsigned int imbalance;
+		unsigned int imbalance = 0;
 		int src_running, dst_running;
 
 		/*
-		 * Would movement cause an imbalance? Note that if src has
-		 * more running tasks that the imbalance is ignored as the
-		 * move improves the imbalance from the perspective of the
-		 * CPU load balancer.
-		 * */
-		src_running = env->src_stats.nr_running - 1;
-		dst_running = env->dst_stats.nr_running + 1;
-		imbalance = max(0, dst_running - src_running);
-		imbalance = adjust_numa_imbalance(imbalance, src_running);
+		 * Check the imbalance if both src and dst have spare
+		 * capacity. If src is fully_busy or overloaded then
+		 * allow the task to move as it's both improving locality
+		 * and reducing an imbalance.
+		 */
+		if (env->src_stats.node_type == node_has_spare) {
+			/*
+			 * Would movement cause an imbalance? Note that if src
+			 * has more running tasks that the imbalance is
+			 * ignored as the move improves the imbalance from the
+			 * perspective of the CPU load balancer.
+			 */
+			src_running = env->src_stats.nr_running - 1;
+			dst_running = env->dst_stats.nr_running + 1;
+			imbalance = max(0, dst_running - src_running);
+			imbalance = adjust_numa_imbalance(imbalance, src_running);
+		}
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {

> > Then there is this part
> > 
> > +               imbalance = adjust_numa_imbalance(imbalance,
> > +                                               env->src_stats.nr_running);
> > +
> > +               //Do nothing without imbalance
> > +               if (!imbalance) {
> > +                       imbalance = 2;
> > +                       goto check_imb;
> > +               }
> > 
> > So... if there is no imbalance, assume there is an imbalance of 2, jump to
> > a branch that will always be false and fall through to code that ignores
> > the value of imbalance ...... it's hard to see exactly why that code flow
> > is ideal.
> > 
> With spare capacity be certain, then lets see how idle the node is.

But you no longer check how idle it is or what it's relative idleness of
the destination is relative to the source. adjust_numa_imbalance does
not calculate an imbalance, it only decides whether it should be
ignored.

> And I
> can not do that without the tool you created, He bless you. Although not
> sure they're two tasks talking to each other. This is another topic in the
> coming days.
> 

There is insufficient context in this path to determine if two tasks are
communicating. In some cases it may be inferred from last_wakee but that
only works for two tasks, it doesn't work for 1:n waker:wakees such as
might happen with a producer/consumer pattern. I guess you could record
the time that tasks migrated cross-node due to a wakeup and avoid migrating
those tasks for a period of time by either NUMA or load balancer but that
could cause transient overloads of a domain if there are a lot of wakeups
(e.g. hackbench). It's a change that would need to be done on both the
NUMA and load balancer to force the migration even if there are wakeup
relationships if the domain is fully busy or overloaded.

-- 
Mel Gorman
SUSE Labs