[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20200522110500.GD7167@techsingularity.net>
Date: Fri, 22 May 2020 12:05:00 +0100
From: Mel Gorman <mgorman@...hsingularity.net>
To: Hillf Danton <hdanton@...a.com>
Cc: Jirka Hladky <jhladky@...hat.com>, Phil Auld <pauld@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Valentin Schneider <valentin.schneider@....com>,
LKML <linux-kernel@...r.kernel.org>,
Douglas Shakshober <dshaks@...hat.com>,
Waiman Long <longman@...hat.com>,
Joe Mario <jmario@...hat.com>, Bill Gray <bgray@...hat.com>,
"aokuliar@...hat.com" <aokuliar@...hat.com>,
"kkolakow@...hat.com" <kkolakow@...hat.com>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load
balancer v6
On Fri, May 22, 2020 at 09:09:50AM +0800, Hillf Danton wrote:
>
> On Thu, 21 May 2020 17:04:08 +0100 Mel Gorman wrote:
> > On Thu, May 21, 2020 at 10:09:31PM +0800, Hillf Danton wrote:
> > > > I'm ignoring the coding style of c++ comments but minimally that should
> > > > be fixed. More importantly, node_type can be one of node_overloaded,
> > > > node_has_spare or node_fully busy and this is checking if there is a
> > > > mismatch. However, it's not taking into account whether the dst_node
> > > > is more or less loaded than the src and does not appear to be actually
> > > > checking spare capacity like the comment suggests.
> > > >
> > >
> > > Type other than node_has_spare is not considered because node is not
> > > possible to be so idle that two communicating tasks would better
> > > "stay local."
> > >
> >
> > You hardcode an imbalance of 2 at the start without computing any
> > imbalance.
>
> Same result comes up if it's a bool.
>
Then the magic number is simply confusing. The patch needs to be
a lot clearer about what the intent is if my patch that adds a "if
(env->src_stats.node_type == node_has_spare)" check is not what you were
aiming for.
> > Then if the source is fully_busy or overloaded while the
> > dst is idle, a task can move but that is based on an imaginary hard-coded
> > imbalance of 2.
>
> This is the potpit I walked around by checking spare capacity first. As
> for overloaded, I see it as a signal that a glitch is not idle somewhere
> else, and I prefer to push it in to ICU before it's too late.
>
The domain could be overloaded simply due to CPU bindings. I cannot
parse the ICU comment (Intensive Care Unit?!?) :(
> > Finally, it appears that that the load balancer and
> > NUMA balancer are again using separate logic again when one part of the
> > objective of the series is that the load balancer and NUMA balancer would
> > not override each other.
>
> This explains 80% of why it is a choppy road ahead.
>
And I don't think we should go back to the load balancer and NUMA balancer
taking different actions. It ends up doing useless CPU migrations and
can lead to higher NUMA scanning activity. It's partially why I changed
task_numa_compare to prefer swapping with tasks that move to their
preferred node when using an idle CPU would cause an imbalance.
> > As the imbalance is never computed, the patch
> > can create one which the load balancer then overrides. What am I missing?
>
> LB would give a green light if things move in the direction in favor of
> cutting imb.
>
Load balancing primarily cares about balancing the number of idle CPUs
between domains when there is spare capacity. While it tries to avoid
balancing by moving tasks from their preferred node, it will do so if
there are no other candidates.
> > > > <SNIP>
> > > >
> > > > Then there is this part
> > > >
> > > > + imbalance = adjust_numa_imbalance(imbalance,
> > > > + env->src_stats.nr_running);
> > > > +
> > > > + //Do nothing without imbalance
> > > > + if (!imbalance) {
> > > > + imbalance = 2;
> > > > + goto check_imb;
> > > > + }
> > > >
> > > > So... if there is no imbalance, assume there is an imbalance of 2, jump to
> > > > a branch that will always be false and fall through to code that ignores
> > > > the value of imbalance ...... it's hard to see exactly why that code flow
> > > > is ideal.
> > > >
> > > With spare capacity be certain, then lets see how idle the node is.
> >
> > But you no longer check how idle it is or what it's relative idleness of
> > the destination is relative to the source. adjust_numa_imbalance does
> > not calculate an imbalance, it only decides whether it should be
> > ignored.
>
> Then the idle CPU is no longer so tempting.
>
True. While it's prefectly possible to ignore imbalance and use an idle
CPU if it exists, the load balancer will simply override it later and we
go back to the NUMA balancer and load balancer fighting each other with
the NUMA balancer retrying migrations based on p->numa_migrate_retry.
Whatever logic is used to allow imbalances (be they based on communicating
tasks or preferred locality), it needs to be the same in both balancers.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists