linux-kernel - Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67930dc6-e9e4-44f2-8f10-74325a21b1d5@linux.alibaba.com>
Date: Thu, 20 Jun 2024 16:44:26 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
 lkp@...el.com, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, John Hubbard <jhubbard@...dia.com>,
 Kefeng Wang <wangkefeng.wang@...wei.com>,
 Mel Gorman <mgorman@...hsingularity.net>, Ryan Roberts
 <ryan.roberts@....com>, linux-mm@...ck.org, feng.tang@...el.com,
 fengwei.yin@...el.com
Subject: Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1%
 regression



On 2024/6/20 15:38, Huang, Ying wrote:
> Baolin Wang <baolin.wang@...ux.alibaba.com> writes:
> 
>> On 2024/6/20 10:39, kernel test robot wrote:
>>> Hello,
>>> kernel test robot noticed a -7.1% regression of
>>> vm-scalability.throughput on:
>>> commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support
>>> multi-size THP numa balancing")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>> [still regression on linus/master
>>> 92e5605a199efbaee59fb19e15d6cc2103a04ec2]
>>> testcase: vm-scalability
>>> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
>>> parameters:
>>> 	runtime: 300s
>>> 	size: 512G
>>> 	test: anon-cow-rand-hugetlb
>>> 	cpufreq_governor: performance
>>
>> Thanks for reporting. IIUC numa balancing will not scan hugetlb VMA,
>> I'm not sure how this patch affects the performance of hugetlb cow,
>> but let me try to reproduce it.
>>
>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@...el.com>
>>> | Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>> The kernel config and materials to reproduce are available at:
>>> https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>>>     gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>>> commit:
>>>     6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
>>>     d2136d749d ("mm: support multi-size THP numa balancing")
>>> 6b0ed7b3c77547d2 d2136d749d76af980b3accd7270
>>> ---------------- ---------------------------
>>>            %stddev     %change         %stddev
>>>                \          |                \
>>>        12.02            -1.3       10.72 ±  4%  mpstat.cpu.all.sys%
>>>      1228757            +3.0%    1265679        proc-vmstat.pgfault
> 
> Also from other proc-vmstat stats,
> 
>       21770  36%      +6.1%      23098  28%  proc-vmstat.numa_hint_faults
>        6168 107%     +48.8%       9180  18%  proc-vmstat.numa_hint_faults_local
>      154537  15%     +23.5%     190883  17%  proc-vmstat.numa_pte_updates
> 
> After your patch, more hint page faults occurs, I think this is expected.

This is exactly my confusion, why are there more numa hint faults? The 
hugetlb VMAs will be skipped from scanning, so other VMAs of the 
application will use mTHP or large folio?

> Then, tasks may be moved between sockets because of that, so that some
> hugetlb page access becomes remote?

Yes, that is possible if the application uses some large folio.

>>>      7392513            -7.1%    6865649        vm-scalability.throughput
>>>        17356            +9.4%      18986        vm-scalability.time.user_time
>>>         0.32 ± 22%     -36.9%       0.20 ± 17%  sched_debug.cfs_rq:/.h_nr_running.stddev
>>>        28657 ± 86%     -90.8%       2640 ± 19%  sched_debug.cfs_rq:/.load.stddev
>>>         0.28 ± 35%     -52.1%       0.13 ± 29%  sched_debug.cfs_rq:/.nr_running.stddev
>>>       299.88 ± 27%     -39.6%     181.04 ± 23%  sched_debug.cfs_rq:/.runnable_avg.stddev
>>>       284.88 ± 32%     -44.0%     159.65 ± 27%  sched_debug.cfs_rq:/.util_avg.stddev
>>>         0.32 ± 22%     -37.2%       0.20 ± 17%  sched_debug.cpu.nr_running.stddev
>>>    1.584e+10 ±  2%      -6.9%  1.476e+10 ±  3%  perf-stat.i.branch-instructions
>>>     11673151 ±  3%      -6.3%   10935072 ±  4%  perf-stat.i.branch-misses
>>>         4.90            +3.5%       5.07        perf-stat.i.cpi
>>>       333.40            +7.5%     358.32        perf-stat.i.cycles-between-cache-misses
>>>    6.787e+10 ±  2%      -6.8%  6.324e+10 ±  3%  perf-stat.i.instructions
>>>         0.25            -6.2%       0.24        perf-stat.i.ipc
>>>         4.19            +7.5%       4.51        perf-stat.overall.cpi
>>>       323.02            +7.4%     346.94        perf-stat.overall.cycles-between-cache-misses
>>>         0.24            -7.0%       0.22        perf-stat.overall.ipc
>>>    1.549e+10 ±  2%      -6.8%  1.444e+10 ±  3%  perf-stat.ps.branch-instructions
>>>    6.634e+10 ±  2%      -6.7%  6.186e+10 ±  3%  perf-stat.ps.instructions
>>>        17.33 ± 77%     -10.6        6.72 ±169%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>>>        17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>>>        17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>>>        17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>>>        17.27 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>>>        13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>>>        13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
>>>        13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
>>>        13.23 ± 76%      -8.1        5.13 ±168%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
>>>         3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>>>        17.35 ± 77%     -10.6        6.73 ±169%  perf-profile.children.cycles-pp.asm_exc_page_fault
>>>        17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.do_user_addr_fault
>>>        17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.exc_page_fault
>>>        17.30 ± 77%     -10.6        6.71 ±168%  perf-profile.children.cycles-pp.handle_mm_fault
>>>        17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.children.cycles-pp.hugetlb_fault
>>>        13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.children.cycles-pp.hugetlb_wp
>>>        13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_user_large_folio
>>>        13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_subpage
>>>        13.34 ± 76%      -8.2        5.17 ±168%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
>>>         3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.children.cycles-pp.__mutex_lock
>>>        13.24 ± 76%      -8.1        5.13 ±168%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
>>> Disclaimer:
>>> Results have been estimated based on internal Intel analysis and are provided
>>> for informational purposes only. Any difference in system hardware or software
>>> design or configuration may affect actual performance.
>>>
> 
> --
> Best Regards,
> Huang, Ying