[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <82068e6a-937b-43db-8496-76fdf3158080@arm.com>
Date: Wed, 6 Dec 2023 10:08:00 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Kefeng Wang <wangkefeng.wang@...wei.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Yin Fengwei <fengwei.yin@...el.com>,
David Hildenbrand <david@...hat.com>,
Yu Zhao <yuzhao@...gle.com>,
Catalin Marinas <catalin.marinas@....com>,
Anshuman Khandual <anshuman.khandual@....com>,
Yang Shi <shy828301@...il.com>,
"Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Itaru Kitayama <itaru.kitayama@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
John Hubbard <jhubbard@...dia.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Hugh Dickins <hughd@...gle.com>,
Barry Song <21cnbao@...il.com>,
Alistair Popple <apopple@...dia.com>
Cc: linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory
On 05/12/2023 14:19, Kefeng Wang wrote:
>
>
> On 2023/12/4 18:20, Ryan Roberts wrote:
>> Hi All,
>>
>> A new week, a new version, a new name... This is v8 of a series to implement
>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>> this fares better.
>>
>> The objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>> pages, there are efficiency savings to be had; fewer page faults, batched PTE
>> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>> overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>> advantage of HW TLB compression techniques. A reduction in TLB pressure
>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> This version changes the name and tidies up some of the kernel code and test
>> code, based on feedback against v7 (see change log for details).
>>
>> By default, the existing behaviour (and performance) is maintained. The user
>> must explicitly enable multi-size THP to see the performance benefit. This is
>> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
>> David for the suggestion)! This interface is inspired by the existing
>> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
>> compatibility with the existing PMD-size THP interface, and provides a base for
>> future extensibility. See [8] for detailed discussion of the interface.
>>
>> This series is based on mm-unstable (715b67adf4c8).
>>
>>
>> Prerequisites
>> =============
>>
>> Some work items identified as being prerequisites are listed on page 3 at [9].
>> The summary is:
>>
>> | item | status |
>> |:------------------------------|:------------------------|
>> | mlock | In mainline (v6.7) |
>> | madvise | In mainline (v6.6) |
>> | compaction | v1 posted [10] |
>> | numa balancing | Investigated: see below |
>> | user-triggered page migration | In mainline (v6.7) |
>> | khugepaged collapse | In mainline (NOP) |
>>
>> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
>> John Hubbard has investigated this and concluded that it is A) not clear at the
>> moment what a better policy might be for PTE-mapped THP and B) questions whether
>> this should really be considered a prerequisite given no regression is caused
>> for the default "multi-size THP disabled" case, and there is no correctness
>> issue when it is enabled - its just a potential for non-optimal performance.
>>
>> If there are no disagreements about removing numa balancing from the list (none
>> were raised when I first posted this comment against v7), then that just leaves
>> compaction which is in review on list at the moment.
>>
>> I really would like to get this series (and its remaining comapction
>> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
>> point, but lets see where we get to with review?
>>
>>
>> Testing
>> =======
>>
>> The series includes patches for mm selftests to enlighten the cow and khugepaged
>> tests to explicitly test with multi-size THP, in the same way that PMD-sized
>> THP is tested. The new tests all pass, and no regressions are observed in the mm
>> selftest suite. I've also run my usual kernel compilation and java script
>> benchmarks without any issues.
>>
>> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
>> THP only - they do not include the arm64 contpte follow-on series).
>>
>> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
>> some workloads at [11]. (Observed using v6 of this series as well as the arm64
>> contpte series).
>>
>> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
>> there are some latency regressions also.
>
> Hi Ryan,
>
> Here is some test results based on v6.7-rc1 +
> [PATCH v7 00/10] Small-sized THP for anonymous memory +
> [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings
>
> case1: basepage 64K
> case2: basepage 4K + thp=64k + PAGE_ALLOC_COSTLY_ORDER = 3
> case3: basepage 4K + thp=64k + PAGE_ALLOC_COSTLY_ORDER = 4
Thanks for sharing these results. With the exception of a few outliers, It looks
like the ~rough conclusion is that bandwidth improves, but not as much as 64K
base pages, and latency regresses, but also not as much as 64K base pages?
I expect that over time, as we add more optimizations, we will get bandwidth
closer to 64K base pages; one crucial one is getting executable file-backed
memory into contpte mappings, for example.
It's probably not time to switch PAGE_ALLOC_COSTLY_ORDER quite yet; but
something to keep an eye on and consider down the road?
Thanks,
Ryan
>
> The results is compared with basepage 4K on Kunpeng920.
>
> Note,
> - The test based on ext4 filesystem and THP=2M is disabled.
> - The results were not analyzed, it is for reference only,
> as some values of test items are not consistent.
>
> 1) Unixbench 1core
> Index_Values_1core case1 case2 case3
> Dhrystone_2_using_register_variables 0.28% 0.39% 0.17%
> Double-Precision_Whetstone -0.01% 0.00% 0.00%
> Execl_Throughput *21.13%* 2.16% 3.01%
> File_Copy_1024_bufsize_2000_maxblocks -0.51% *8.33%* *8.76%*
> File_Copy_256_bufsize_500_maxblocks 0.78% *11.89%* *10.85%*
> File_Copy_4096_bufsize_8000_maxblocks 7.42% 7.27% *10.66%*
> Pipe_Throughput -0.24% *6.82%* *5.08%*
> Pipe-based_Context_Switching 1.38% *13.49%* *9.91%*
> Process_Creation *32.46%* 4.30% *8.54%*
> Shell_Scripts_(1_concurrent) *31.67%* 1.92% 2.60%
> Shell_Scripts_(8_concurrent) *40.59%* 1.30% *5.29%*
> System_Call_Overhead 3.92% *8.13% 2.96%
>
> System_Benchmarks_Index_Score 10.66% 5.39% 5.58%
>
> For 1core,
> - case1 wins on Execl_Throughput/Process_Creation/Shell_Scripts
> a lot, and score higher 10.66% vs basepage 4K.
> - case2/3 wins on File_Copy/Pipe and score higher 5%+ than basepage 4K,
> also case3 looks better on Shell_Scripts_(8_concurrent) than case2.
>
> 2) Unixbench 128core
> Index_Values_128core case1 case2 case3
> Dhrystone_2_using_register_variables 2.07% -0.03% -0.11%
> Double-Precision_Whetstone -0.03% 0.00% 0.00%
> Execl_Throughput *39.28%* -4.23% 1.93%
> File_Copy_1024_bufsize_2000_maxblocks 5.46% 1.30% 4.20%
> File_Copy_256_bufsize_500_maxblocks -8.89% *6.56% *5.02%*
> File_Copy_4096_bufsize_8000_maxblocks 3.43% *-5.46%* 0.56%
> Pipe_Throughput 3.80% *7.69% *7.80%*
> Pipe-based_Context_Switching *7.62%* 0.95% 4.69%
> Process_Creation *28.11%* -2.79% 2.40%
> Shell_Scripts_(1_concurrent) *39.68%* 1.86% *5.30%*
> Shell_Scripts_(8_concurrent) *41.35%* 2.49% *7.16%*
> System_Call_Overhead -1.55% -0.04% *8.23%*
>
> System_Benchmarks_Index_Score 12.08% 0.63% 3.88%
>
> For 128core,
> - case1 wins on Execl_Throughput/Process_Creation/Shell_Scripts
> a lot, also good at Pipe-based_Context_Switching, and score higher
> 12.08% vs basepage 4K.
> - case2/case3 wins on File_Copy_256/Pipe_Throughput, but case2 is
> not better than basepage 4K, case3 wins 3.88%.
>
> 3) Lmbench Processor_processes
> Processor_Processes case1 case2 case3
> null_call 1.76% 0.40% 0.65%
> null_io -0.76% -0.38% -0.23%
> stat *-16.09%* *-12.49%* 4.22%
> open_close -2.69% 4.51% 3.21%
> slct_TCP -0.56% 0.00% -0.44%
> sig_inst -1.54% 0.73% 0.70%
> sig_hndl -2.85% 0.01% 1.85%
> fork_proc *23.31%* 8.77% -5.42%
> exec_proc *13.22%* -0.30% 1.09%
> sh_proc *14.04%* -0.10% 1.09%
>
> - case1 is much better than basepage 4K, same as Unixbench test,
> case2 is better on fork_proc, but case3 is worse
> - note: the variance of fork/exec/sh is bigger than others
>
> 4) Lmbench Context_switching_ctxsw
> Context_switching_ctxsw case1 case2 case3
> 2p/0K -12.16% -5.29% -1.86%
> 2p/16K -11.26% -3.71% -4.53%
> 2p/64K -2.60% 3.84% -1.98%
> 8p/16K -7.56% -1.21% -0.88%
> 8p/64K 5.10% 4.88% 1.19%
> 16p/16K -5.81% -2.44% -3.84%
> 16p/64K 4.29% -1.94% -2.50%
> - case1/2/3 worse than basepage 4K and case1 is the worst.
>
> 4) Lmbench Local_latencies
> Local_latencies case1 case2 case3
> Pipe -9.23% 0.58% -4.34%
> AF_UNIX -5.34% -1.76% 3.03%
> UDP -6.70% -5.96% -9.81%
> TCP -7.95% -7.58% -5.63%
> TCP_conn -213.99% -227.78% -659.67%
> - TCP_conn is very unreliable, ignore it
> - case1/2/3 slower than basepage 4K
>
> 5) Lmbench File_&_VM_latencies
> File_&_VM_latencies case1 case2 case3
> 10K_File_Create 2.60% -0.52% 2.66%
> 10K_File_Delete -2.91% -5.20% -2.11%
> 10K_File_Create 10.23% 1.18% 0.12%
> 10K_File_Delete -17.76% -2.97% -1.49%
> Mmap_Latency *63.05%* 2.57% -0.96%
> Prot_Fault 10.41% -3.21% *-19.11%*
> Page_Fault *-132.01%* 2.35% -0.79%
> 100fd_selct -1.20% 0.10% 0.31%
> - case1 is very good at Mmap_Latency and not good at Page_fault
> - case2/3 slower on Prot_Faul/10K_FILE_Delete vs basepage 4k,
> the rest doesn't look much different.
>
> 6) Lmbench Local_bandwidths
> Local_bandwidths case1 case2 case3
> Pipe 265.22% 15.44% 11.33%
> AF_UNIX 13.41% -2.66% 2.63%
> TCP -1.30% 25.90% 2.48%
> File_reread 14.79% 31.52% -14.16%
> Mmap_reread 27.47% 49.00% -0.11%
> Bcopy(libc) 2.58% 2.45% 2.46%
> Bcopy(hand) 25.78% 22.56% 22.68%
> Mem_read 38.26% 36.80% 36.49%
> Mem_write 10.93% 3.44% 3.12%
>
> - case1 is very good at bandwidth, case2 is better than basepage 4k
> but lower than case1, case3 is bad at File_reread
>
> 7)Lmbench Memory_latencies
> Memory_latencies case1 case2 case3
> L1_$ 0.02% 0.00% -0.03%
> L2_$ -1.56% -2.65% -1.25%
> Main_mem 50.82% 32.51% 33.47%
> Rand_mem 15.29% -8.79% -8.80%
>
> - case1 also good at Main/Rand mem access latencies,
> - case2/case3 is better at Main_mem, but worse at Rand_mem.
>
> Tested-by: Kefeng Wang <wangkefeng.wang@...wei.com>
>
>
>
>
>
>
>
Powered by blists - more mailing lists