[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5941ce856343c314f829852d3a831cdd19f06573.camel@intel.com>
Date: Thu, 19 May 2022 15:54:39 +0800
From: "ying.huang@...el.com" <ying.huang@...el.com>
To: Mel Gorman <mgorman@...hsingularity.net>,
kernel test robot <oliver.sang@...el.com>
Cc: 0day robot <lkp@...el.com>, LKML <linux-kernel@...r.kernel.org>,
lkp@...ts.01.org, feng.tang@...el.com,
zhengjun.xing@...ux.intel.com, fengwei.yin@...el.com,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Valentin Schneider <valentin.schneider@....com>,
Aubrey Li <aubrey.li@...ux.intel.com>, yu.c.chen@...el.com
Subject: Re: [sched/numa] bb2dee337b: unixbench.score -11.2% regression
Hi, Mel,
On Wed, 2022-05-18 at 16:22 +0100, Mel Gorman wrote:
> On Wed, May 18, 2022 at 05:24:14PM +0800, kernel test robot wrote:
> >
> >
> > Greeting,
> >
> > FYI, we noticed a -11.2% regression of unixbench.score due to commit:
> >
> >
> > commit: bb2dee337bd7d314eb7c7627e1afd754f86566bc ("[PATCH 3/4] sched/numa: Apply imbalance limitations consistently")
> > url: https://github.com/intel-lab-lkp/linux/commits/Mel-Gorman/Mitigate-inconsistent-NUMA-imbalance-behaviour/20220511-223233
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d70522fc541224b8351ac26f4765f2c6268f8d72
> > patch link: https://lore.kernel.org/lkml/20220511143038.4620-4-mgorman@techsingularity.net
> >
> > in testcase: unixbench
> > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> > with following parameters:
> >
> > runtime: 300s
> > nr_task: 1
> > test: shell8
> > cpufreq_governor: performance
> > ucode: 0xd000331
> >
> > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > test-url: https://github.com/kdlucas/byte-unixbench
>
> I think what is happening for unixbench is that it prefers to run all
> instances on a local node if possible. shell8 is creating 8 scripts,
> each of which spawn more processes. The total number of tasks may exceed
> the allowed imbalance at fork time of 16 tasks. Some spill over to a
> remote node and as they are using files, some accesses are remote and it
> suffers. It's not memory bandwidth bound but is sensitive to locality.
> The stats somewhat support this idea
>
> > 83590 ± 13% -73.7% 21988 ± 32% numa-meminfo.node0.AnonHugePages
> > 225657 ± 18% -58.0% 94847 ± 18% numa-meminfo.node0.AnonPages
> > 231652 ± 17% -55.3% 103657 ± 16% numa-meminfo.node0.AnonPages.max
> > 234525 ± 17% -55.5% 104341 ± 18% numa-meminfo.node0.Inactive
> > 234397 ± 17% -55.5% 104267 ± 18% numa-meminfo.node0.Inactive(anon)
> > 11724 ± 7% +17.5% 13781 ± 5% numa-meminfo.node0.KernelStack
> > 4472 ± 34% +117.1% 9708 ± 31% numa-meminfo.node0.PageTables
> > 15239 ± 75% +401.2% 76387 ± 10% numa-meminfo.node1.AnonHugePages
> > 67256 ± 63% +206.3% 205994 ± 6% numa-meminfo.node1.AnonPages
> > 73568 ± 58% +193.1% 215644 ± 6% numa-meminfo.node1.AnonPages.max
> > 75737 ± 53% +183.9% 215053 ± 6% numa-meminfo.node1.Inactive
> > 75709 ± 53% +183.9% 214971 ± 6% numa-meminfo.node1.Inactive(anon)
> > 3559 ± 42% +187.1% 10216 ± 8% numa-meminfo.node1.PageTables
>
> There is less memory used on one node and more on the other so it's
> getting split.
This makes sense. I will also check CPU utilization per node to verify
this directly.
>
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+-------------------------------------------------------------------------------------+
> > > testcase: change | fsmark: fsmark.files_per_sec -21.5% regression |
> > > test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> > > test parameters | cpufreq_governor=performance |
> > > | disk=1SSD |
> > > | filesize=8K |
> > > | fs=f2fs |
> > > | iterations=8 |
> > > | nr_directories=16d |
> > > | nr_files_per_directory=256fpd |
> > > | nr_threads=4 |
> > > | sync_method=fsyncBeforeClose |
> > > | test_size=72G |
> > > | ucode=0x500320a |
> > +------------------+-------------------------------------------------------------------------------------+
> >
>
> It's less clearcut for this from the stats but it's likely getting split
> too and had preferred locality. It's curious that f2fs is affected but
> maybe other filesystems were too.
>
> In both cases, the workloads are not memory bandwidth limited so prefer to
> stack on one node and previously, because they were cache hot, the load
> balancer would avoid splitting them apart if there were other candidates
> available.
>
> This is a tradeoff between loads that want to stick on one node for
> locality because they are not bandwidth limited and workloads that are
> memory bandwidth limited and want to spread wide. We can't tell what
> type of workload it is at fork time.
>
> Given there is no crystal ball and it's a tradeoff, I think it's better
> to be consistent and use similar logic at both fork time and runtime even
> if it doesn't have universal benefit.
>
Thanks for detailed explanation. So some other workloads may benefit
from this patch. Can you give me some candidate so I can test them too?
Best Regards,
Huang, Ying
Powered by blists - more mailing lists