[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c48c5b04-b20d-4af7-b88a-5aae9386aaec@amd.com>
Date: Tue, 28 Jan 2025 12:24:32 +0530
From: Shivank Garg <shivankg@....com>
To: Zi Yan <ziy@...dia.com>, David Rientjes <rientjes@...gle.com>
Cc: akpm@...ux-foundation.org, lsf-pc@...ts.linux-foundation.org,
linux-mm@...ck.org, AneeshKumar.KizhakeVeetil@....com,
baolin.wang@...ux.alibaba.com, bharata@....com, david@...hat.com,
gregory.price@...verge.com, honggyu.kim@...com, jane.chu@...cle.com,
jhubbard@...dia.com, jon.grimm@....com, k.shutemov@...il.com,
leesuyeon0506@...il.com, leillc@...gle.com, liam.howlett@...cle.com,
linux-kernel@...r.kernel.org, mel.gorman@...il.com, Michael.Day@....com,
Raghavendra.KodsaraThimmappa@....com, riel@...riel.com,
santosh.shukla@....com, shy828301@...il.com, sj@...nel.org,
wangkefeng.wang@...wei.com, weixugc@...gle.com, willy@...radead.org,
ying.huang@...ux.alibaba.com, Jonathan.Cameron@...wei.com
Subject: Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with
Multi-threading and Batch Offloading to DMA
Hi David, Zi,
On 1/27/2025 6:07 PM, Zi Yan wrote:
> On 27 Jan 2025, at 1:55, David Rientjes wrote:
>
>> On Thu, 23 Jan 2025, Shivank Garg wrote:
>>
>>> Hi all,
>>>
>>> Zi Yan and I would like to propose the topic: Enhancements to Page
>>> Migration with Multi-threading and Batch Offloading to DMA.
>>>
>>
>> I think this would be a very useful topic to discuss, thanks for proposing
>> it.
Thanks for your interest in our proposal.
>>
>>> Page migration is a critical operation in NUMA systems that can incur
>>> significant overheads, affecting memory management performance across
>>> various workloads. For example, copying folios between DRAM NUMA nodes
>>> can take ~25% of the total migration cost for migrating 256MB of data.
>>>
>>> Modern systems are equipped with powerful DMA engines for bulk data
>>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>>> capabilities becomes essential for systems where frequent page promotion
>>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>>
>>
>> Indeed, there are multiple use cases for optimizations in this area. With
>> the ramp of memory tiered systems, I think there will be an even greater
>> reliance on memory migration going forward.
>>
>> Do you have numbers to share on how offloading, even as a proof of
>> concept, moves the needle compared to traditional and sequential memory
>> migration?
>
> For multithreaded page migration, you can see my RFC patchset[1]:
>
> on NVIDIA Grace:
>
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
>
> 64KB (GB/s):
>
> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32
> 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43
> 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93
> 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93
> 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76
> 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80
>
> 2MB mTHP (GB/s):
>
> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32
> 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41
> 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10
> 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28
> 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55
> 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38
> 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51
> 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41
> 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18
> 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28
> 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21
> 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31
> 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84
>
>
> I also ran it on on a two socket Xeon E5-2650 v4:
>
>
> 4KB (GB/s)
>
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 |
> | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 |
> | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 |
> | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 |
> | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 |
>
>
>
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 |
> | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 |
> | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 |
> | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 |
> | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 |
> | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 |
> | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 |
> | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 |
> | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 |
> | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
>
>
>
> Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):
>
> 2MB pages (GB/s):
> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
> 1 10.74 11.04 4.68 8.17 6.47 6.09 3.97 6.20
> 2 12.44 4.90 11.19 14.10 15.33 8.45 10.09 9.97
> 4 14.82 9.80 11.93 18.35 21.82 17.09 10.53 7.51
> 8 16.13 9.91 15.26 11.85 26.53 13.09 12.71 13.75
> 16 15.99 8.81 13.84 22.43 33.89 11.91 12.30 13.26
> 32 14.03 11.37 17.54 23.96 57.07 18.78 19.51 21.29
> 64 15.79 9.55 22.19 33.17 57.18 65.51 55.39 62.53
> 128 18.22 16.65 21.49 30.73 52.99 61.05 58.44 60.38
> 256 19.78 20.56 24.72 34.94 56.73 71.11 61.83 62.77
> 512 20.27 21.40 27.47 39.23 65.72 67.97 70.48 71.39
> 1024 20.48 21.48 27.48 38.30 68.62 77.94 78.00 78.95
>
>
>
>>
>>> Existing page migration performs sequential page copying, underutilizing
>>> modern CPU architectures and high-bandwidth memory subsystems.
>>>
>>> We have proposed and posted RFCs to enhance page migration through three
>>> key techniques:
>>> 1. Batching migration operations for bulk copying data [1]
>>> 2. Multi-threaded folio copying [2]
>>> 3. DMA offloading to hardware accelerators [1]
>>>
>>
>> Curious: does memory migration of pages that are actively undergoing DMA
>> with hardware assist fit into any of these?
>
> It should be similar to 3, but in this case, DMA is used to copy pages
> between NUMA nodes, whereas traditional DMA page migration is used to copy
> pages between host and devices.
>
I'm planning to test using SDXi as the DMA engine for offload and it
doesn't support migrating pages that are actively undergoing DMA AFAIU.
>>
>>> By employing batching and multi-threaded folio copying, we are able to
>>> achieve significant improvements in page migration throughput for large
>>> pages.
>>>
>>> Discussion points:
>>> 1. Performance:
>>> a. Policy decision for DMA and CPU selection
>>> b. Platform-specific scheduling of folio-copy worker threads for better
>>> bandwidth utilization
>>
>> Why platform specific? I *assume* this means a generic framework that can
>> optimize for scheduling based on the underlying hardware and not specific
>> implementations that can only be used on AMD, for example. Is that the
>> case?
>
> I think the framework will be generic but the CPU scheduling (which core
> to choose for page copying) will be different from vendor to vendor.
>
> Due to existing CPU structure, like chiplet design, a single CPU scheduling
> algorithm does not fit for CPUs from different vendors. For example, on
> NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
> page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
> threads across different CCDs can achieve much higher page copy throughput
> than putting all threads in a single CCD. I assume Intel CPUs with chiplet
> design would see the same result.
Thank you Zi for helping with results and queries.
>
>>
>>> c. Using Non-temporal instructions for CPU-based memcpy
>>> d. Upscaling/downscaling worker threads based on migration size, CPU
>>> availability (system load), bandwidth saturation, etc.
>>> 2. Interface requirements with DMA hardware:
>>> a. Standardizing APIs for DMA drivers and support for different DMA
>>> drivers
>>> b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>>> 3. Resources Accounting:
>>> a. CPU cgroups accounting and fairness [3]
>>> b. Who bears migration cost? - (Migration cost attribution)
>>>
>>> References:
>>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>>>
>
> [1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com/
> --
> Best Regards,
> Yan, Zi
>
Powered by blists - more mailing lists