[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87bk3w2he5.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Thu, 20 Jun 2024 15:38:26 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: kernel test robot <oliver.sang@...el.com>, <oe-lkp@...ts.linux.dev>,
<lkp@...el.com>, <linux-kernel@...r.kernel.org>, Andrew Morton
<akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>, John
Hubbard <jhubbard@...dia.com>, Kefeng Wang <wangkefeng.wang@...wei.com>,
Mel Gorman <mgorman@...hsingularity.net>, Ryan Roberts
<ryan.roberts@....com>, <linux-mm@...ck.org>, <feng.tang@...el.com>,
<fengwei.yin@...el.com>
Subject: Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1%
regression
Baolin Wang <baolin.wang@...ux.alibaba.com> writes:
> On 2024/6/20 10:39, kernel test robot wrote:
>> Hello,
>> kernel test robot noticed a -7.1% regression of
>> vm-scalability.throughput on:
>> commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support
>> multi-size THP numa balancing")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>> [still regression on linus/master
>> 92e5605a199efbaee59fb19e15d6cc2103a04ec2]
>> testcase: vm-scalability
>> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
>> parameters:
>> runtime: 300s
>> size: 512G
>> test: anon-cow-rand-hugetlb
>> cpufreq_governor: performance
>
> Thanks for reporting. IIUC numa balancing will not scan hugetlb VMA,
> I'm not sure how this patch affects the performance of hugetlb cow,
> but let me try to reproduce it.
>
>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@...el.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>> gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>> commit:
>> 6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
>> d2136d749d ("mm: support multi-size THP numa balancing")
>> 6b0ed7b3c77547d2 d2136d749d76af980b3accd7270
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 12.02 -1.3 10.72 ± 4% mpstat.cpu.all.sys%
>> 1228757 +3.0% 1265679 proc-vmstat.pgfault
Also from other proc-vmstat stats,
21770 36% +6.1% 23098 28% proc-vmstat.numa_hint_faults
6168 107% +48.8% 9180 18% proc-vmstat.numa_hint_faults_local
154537 15% +23.5% 190883 17% proc-vmstat.numa_pte_updates
After your patch, more hint page faults occurs, I think this is expected.
Then, tasks may be moved between sockets because of that, so that some
hugetlb page access becomes remote?
>> 7392513 -7.1% 6865649 vm-scalability.throughput
>> 17356 +9.4% 18986 vm-scalability.time.user_time
>> 0.32 ± 22% -36.9% 0.20 ± 17% sched_debug.cfs_rq:/.h_nr_running.stddev
>> 28657 ± 86% -90.8% 2640 ± 19% sched_debug.cfs_rq:/.load.stddev
>> 0.28 ± 35% -52.1% 0.13 ± 29% sched_debug.cfs_rq:/.nr_running.stddev
>> 299.88 ± 27% -39.6% 181.04 ± 23% sched_debug.cfs_rq:/.runnable_avg.stddev
>> 284.88 ± 32% -44.0% 159.65 ± 27% sched_debug.cfs_rq:/.util_avg.stddev
>> 0.32 ± 22% -37.2% 0.20 ± 17% sched_debug.cpu.nr_running.stddev
>> 1.584e+10 ± 2% -6.9% 1.476e+10 ± 3% perf-stat.i.branch-instructions
>> 11673151 ± 3% -6.3% 10935072 ± 4% perf-stat.i.branch-misses
>> 4.90 +3.5% 5.07 perf-stat.i.cpi
>> 333.40 +7.5% 358.32 perf-stat.i.cycles-between-cache-misses
>> 6.787e+10 ± 2% -6.8% 6.324e+10 ± 3% perf-stat.i.instructions
>> 0.25 -6.2% 0.24 perf-stat.i.ipc
>> 4.19 +7.5% 4.51 perf-stat.overall.cpi
>> 323.02 +7.4% 346.94 perf-stat.overall.cycles-between-cache-misses
>> 0.24 -7.0% 0.22 perf-stat.overall.ipc
>> 1.549e+10 ± 2% -6.8% 1.444e+10 ± 3% perf-stat.ps.branch-instructions
>> 6.634e+10 ± 2% -6.7% 6.186e+10 ± 3% perf-stat.ps.instructions
>> 17.33 ± 77% -10.6 6.72 ±169% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>> 17.30 ± 77% -10.6 6.71 ±169% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>> 17.30 ± 77% -10.6 6.71 ±169% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>> 17.28 ± 77% -10.6 6.70 ±169% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>> 17.27 ± 77% -10.6 6.70 ±169% perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>> 13.65 ± 76% -8.4 5.29 ±168% perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>> 13.37 ± 76% -8.2 5.18 ±168% perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
>> 13.35 ± 76% -8.2 5.18 ±168% perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
>> 13.23 ± 76% -8.1 5.13 ±168% perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
>> 3.59 ± 78% -2.2 1.39 ±169% perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>> 17.35 ± 77% -10.6 6.73 ±169% perf-profile.children.cycles-pp.asm_exc_page_fault
>> 17.32 ± 77% -10.6 6.72 ±168% perf-profile.children.cycles-pp.do_user_addr_fault
>> 17.32 ± 77% -10.6 6.72 ±168% perf-profile.children.cycles-pp.exc_page_fault
>> 17.30 ± 77% -10.6 6.71 ±168% perf-profile.children.cycles-pp.handle_mm_fault
>> 17.28 ± 77% -10.6 6.70 ±169% perf-profile.children.cycles-pp.hugetlb_fault
>> 13.65 ± 76% -8.4 5.29 ±168% perf-profile.children.cycles-pp.hugetlb_wp
>> 13.37 ± 76% -8.2 5.18 ±168% perf-profile.children.cycles-pp.copy_user_large_folio
>> 13.35 ± 76% -8.2 5.18 ±168% perf-profile.children.cycles-pp.copy_subpage
>> 13.34 ± 76% -8.2 5.17 ±168% perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
>> 3.59 ± 78% -2.2 1.39 ±169% perf-profile.children.cycles-pp.__mutex_lock
>> 13.24 ± 76% -8.1 5.13 ±168% perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>>
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists