[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200907124216.GE3179@techsingularity.net>
Date: Mon, 7 Sep 2020 13:42:16 +0100
From: Mel Gorman <mgorman@...hsingularity.net>
To: "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
Cc: Mel Gorman <mgorman@...e.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"peterz@...radead.org" <peterz@...radead.org>,
"juri.lelli@...hat.com" <juri.lelli@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"dietmar.eggemann@....com" <dietmar.eggemann@....com>,
"bsegall@...gle.com" <bsegall@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Valentin Schneider <valentin.schneider@....com>,
Phil Auld <pauld@...hat.com>, Hillf Danton <hdanton@...a.com>,
Ingo Molnar <mingo@...nel.org>, Linuxarm <linuxarm@...wei.com>,
"Liguozhu (Kenneth)" <liguozhu@...ilicon.com>
Subject: Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src
node has very light load?
On Mon, Sep 07, 2020 at 12:00:10PM +0000, Song Bao Hua (Barry Song) wrote:
> Hi All,
> In case we have a numa system with 4 nodes and in each node we have 24 cpus, and all of the 96 cores are idle.
> Then we start a process with 4 threads in this totally idle system.
> Actually any one of the four numa nodes should have enough capability to run the 4 threads while they can still have 20 idle CPUS after that.
> But right now the existing code in CFS load balance will spread the 4 threads to multiple nodes.
> This results in two negative side effects:
> 1. more numa nodes are awaken while they can save power in lowest frequency and halt status
> 2. cache coherency overhead between numa nodes
>
> A proof-of-concept patch I made to "fix" this issue to some extent is like:
>
This is similar in concept to a patch that did something similar except
in adjust_numa_imbalance(). It ended up being great for light loads like
simple communicating pairs but fell apart for some HPC workloads when
memory bandwidth requirements increased. Ultimately it was dropped until
the NUMA/CPU load balancing was reconciled so may be worth a revisit. At
the time, it was really problematic once a one node was roughly 25% CPU
utilised on a 2-socket machine with hyper-threading enabled. The patch may
still work out but it would need wider testing. Within mmtests, the NAS
workloads for D-class on a 2-socket machine varying the number of parallel
tasks/processes are used should be enough to determine if the patch is
free from side-effects for one machine. It gets problematic for different
machine sizes as the point where memory bandwidth is saturated varies.
group_weight/4 might be fine on one machine as a cut-off and a problem
on a larger machine with more cores -- I hit that particular problem
when one 2 socket machine with 48 logical CPUs was fine but a different
machine with 80 logical CPUs regressed.
I'm not saying the patch is wrong, just that patches in general for this
area (everyone, not just you) need fairly broad testing.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists