[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200120080935.GD20112@linux.vnet.ibm.com>
Date: Mon, 20 Jan 2020 13:39:35 +0530
From: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
To: Mel Gorman <mgorman@...hsingularity.net>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
Phil Auld <pauld@...hat.com>, Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Valentin Schneider <valentin.schneider@....com>,
Quentin Perret <quentin.perret@....com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Morten Rasmussen <Morten.Rasmussen@....com>,
Hillf Danton <hdanton@...a.com>,
Parth Shah <parth@...ux.ibm.com>,
Rik van Riel <riel@...riel.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small load imbalance between low
utilisation SD_NUMA domains v4
* Mel Gorman <mgorman@...hsingularity.net> [2020-01-17 21:58:53]:
> On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@...hsingularity.net> [2020-01-14 10:13:20]:
> >
> > We certainly are seeing better results than v1.
> > However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> > the others are improving.
> >
> > While numa04 improves by 14%, numa02 regress by around 12%.
> >
> Ok, so it's both a win and a loss. This is a curiousity that this patch
> may be the primary factor given that the logic only triggers when the
> local group has spare capacity and the busiest group is nearly idle. The
> test cases you describe should have fairly busy local groups.
>
Right, your code only seems to affect when the local group has spare
capacity and the busiest->sum_nr_running <=2
> >
> > numa01 is a set of 2 process each running 128 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
>
> Are the shared operations shared between the 2 processes? 256 threads
> in total would more than exceed the capacity of a local group, even 128
> threads per process would exceed the capacity of the local group. In such
> a situation, much would depend on the locality of the accesses as well
> as any shared accesses.
Except for numa02 and numa07, (both handle local memory operations) all
shared operations are within the process. i.e per process sharing.
>
> > numa02 is a single process with 256 threads;
> > each thread doing 800 loops on 32MB thread local memory operations.
> >
>
> This one is more interesting. False sharing shouldn't be an issue so the
> threads should be independent.
>
> > numa03 is a single process with 256 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Similar.
This is similar to numa01. Except now all threads belong to just one
process.
>
> > numa04 is a set of 8 process (as many nodes) each running 32 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Less clear as you don't say what is sharing the memory operations.
all sharing is within the process. In Numa04/numa09, I try to spawn as many
process as the number of nodes, other than that its same as Numa02.
>
> > numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> > Details below:
>
> How many iterations for each test?
I run 5 iterations. Want me to run with more iterations?
>
>
> > ./numa02.sh Real: 78.87 82.31 80.59 1.72 -12.7187%
> > ./numa02.sh Sys: 81.18 85.07 83.12 1.94 -35.0337%
> > ./numa02.sh User: 16303.70 17122.14 16712.92 409.22 -12.5182%
>
> Before range: 58 to 72
> After range: 78 to 82
>
> This one is more interesting in general. Can you add trace_printks to
> the check for SD_NUMA the patch introduces and dump the sum_nr_running
> for both local and busiest when the imbalance is ignored please? That
> might give some hint as to the improper conditions where imbalance is
> ignored.
Can be done. Will get back with the results. But do let me know if you want
to run with more iterations or rerun the tests.
>
> However, knowing the number of iterations would be helpful. Can you also
> tell me if this is consistent between boots or is it always roughly 12%
> regression regardless of the number of iterations?
>
I have only measured for 5 iterations and I haven't repeated to see if the
numbers are consistent.
> > ./numa03.sh Real: 477.20 528.12 502.66 25.46 -4.85219%
> > ./numa03.sh Sys: 88.93 115.36 102.15 13.21 -25.629%
> > ./numa03.sh User: 119120.73 129829.89 124475.31 5354.58 -3.8219%
>
> Range before: 471 to 485
> Range after: 477 to 528
>
> > ./numa04.sh Real: 374.70 414.76 394.73 20.03 14.6708%
> > ./numa04.sh Sys: 357.14 379.20 368.17 11.03 3.27294%
> > ./numa04.sh User: 87830.73 88547.21 88188.97 358.24 5.7113%
>
> Range before: 450 -> 454
> Range after: 374 -> 414
>
> Big gain there but the fact the range changed so much is a concern and
> makes me wonder if this case is stable from boot to boot.
>
> > ./numa05.sh Real: 369.50 401.56 385.53 16.03 -5.64937%
> > ./numa05.sh Sys: 718.99 741.02 730.00 11.01 -3.76438%
> > ./numa05.sh User: 84989.07 85271.75 85130.41 141.34 -1.48142%
> >
>
> Big range changes again but the shared memory operations complicate
> matters. I think it's best to focus on numa02 for and identify if there
> is an improper condition where the patch has an impact, the local group
> has high utilisation but spare capacity while the busiest group is
> almost completely idle.
>
> > vmstat for numa01
>
> I'm not going to comment in detail on these other than noting that NUMA
> balancing is heavily active in all cases which may be masking any effect
> of the patch and may have unstable results in general.
>
> > <SNIP vmstat>
> > <SNIP description of loads that showed gains>
> >
> > numa09 is a set of 8 process (as many nodes) each running 4 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> No description of shared operations but NUMA balancing is very active so
> sharing is probably between processes.
>
> > numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Again, shared accesses without description and heavy NUMA balancing
> activity.
>
> So bottom line, a lot of these cases have shared operations where NUMA
> balancing decisions should dominate and make it hard to detect any impact
> from the patch. The exception is numa02 so please add tracing and dump
> out local and busiest sum_nr_running when the imbalance is ignored. I
> want to see if it's as simple as the local group is very busy but has
> capacity where the busiest group is almost idle. I also want to see how
> many times over the course of the numa02 workload that the conditions
> for the patch are even met.
>
--
Thanks and Regards
Srikar Dronamraju
Powered by blists - more mailing lists