linux-kernel - Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <520F7E0B-E0B7-4A84-9046-B8B5FC6EA9F7@nvidia.com>
Date: Mon, 27 Jan 2025 07:37:19 -0500
From: Zi Yan <ziy@...dia.com>
To: David Rientjes <rientjes@...gle.com>, Shivank Garg <shivankg@....com>
Cc: akpm@...ux-foundation.org, lsf-pc@...ts.linux-foundation.org,
 linux-mm@...ck.org, AneeshKumar.KizhakeVeetil@....com,
 baolin.wang@...ux.alibaba.com, bharata@....com, david@...hat.com,
 gregory.price@...verge.com, honggyu.kim@...com, jane.chu@...cle.com,
 jhubbard@...dia.com, jon.grimm@....com, k.shutemov@...il.com,
 leesuyeon0506@...il.com, leillc@...gle.com, liam.howlett@...cle.com,
 linux-kernel@...r.kernel.org, mel.gorman@...il.com, Michael.Day@....com,
 Raghavendra.KodsaraThimmappa@....com, riel@...riel.com,
 santosh.shukla@....com, shy828301@...il.com, sj@...nel.org,
 wangkefeng.wang@...wei.com, weixugc@...gle.com, willy@...radead.org,
 ying.huang@...ux.alibaba.com
Subject: Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with
 Multi-threading and Batch Offloading to DMA

On 27 Jan 2025, at 1:55, David Rientjes wrote:

> On Thu, 23 Jan 2025, Shivank Garg wrote:
>
>> Hi all,
>>
>> Zi Yan and I would like to propose the topic: Enhancements to Page
>> Migration with Multi-threading and Batch Offloading to DMA.
>>
>
> I think this would be a very useful topic to discuss, thanks for proposing
> it.
>
>> Page migration is a critical operation in NUMA systems that can incur
>> significant overheads, affecting memory management performance across
>> various workloads. For example, copying folios between DRAM NUMA nodes
>> can take ~25% of the total migration cost for migrating 256MB of data.
>>
>> Modern systems are equipped with powerful DMA engines for bulk data
>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>> capabilities becomes essential for systems where frequent page promotion
>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>
>
> Indeed, there are multiple use cases for optimizations in this area.  With
> the ramp of memory tiered systems, I think there will be an even greater
> reliance on memory migration going forward.
>
> Do you have numbers to share on how offloading, even as a proof of
> concept, moves the needle compared to traditional and sequential memory
> migration?

For multithreaded page migration, you can see my RFC patchset[1]:

on NVIDIA Grace:

The 32-thread copy throughput can be up to 10x of single thread serial folio
copy. Batching folio copy not only benefits huge page but also base
page.

64KB (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80

2MB mTHP (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84


I also ran it on on a two socket Xeon E5-2650 v4:


4KB (GB/s)

| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
| 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
| 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
| 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
| 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |



2MB (GB/s)
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
| 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
| 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
| 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
| 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
| 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
| 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
| 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
| 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
| 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
| 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
| 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |



Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):

2MB pages (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95



>
>> Existing page migration performs sequential page copying, underutilizing
>> modern CPU architectures and high-bandwidth memory subsystems.
>>
>> We have proposed and posted RFCs to enhance page migration through three
>> key techniques:
>> 1. Batching migration operations for bulk copying data [1]
>> 2. Multi-threaded folio copying [2]
>> 3. DMA offloading to hardware accelerators [1]
>>
>
> Curious: does memory migration of pages that are actively undergoing DMA
> with hardware assist fit into any of these?

It should be similar to 3, but in this case, DMA is used to copy pages
between NUMA nodes, whereas traditional DMA page migration is used to copy
pages between host and devices.

>
>> By employing batching and multi-threaded folio copying, we are able to
>> achieve significant improvements in page migration throughput for large
>> pages.
>>
>> Discussion points:
>> 1. Performance:
>>    a. Policy decision for DMA and CPU selection
>>    b. Platform-specific scheduling of folio-copy worker threads for better
>>       bandwidth utilization
>
> Why platform specific?  I *assume* this means a generic framework that can
> optimize for scheduling based on the underlying hardware and not specific
> implementations that can only be used on AMD, for example.  Is that the
> case?

I think the framework will be generic but the CPU scheduling (which core
to choose for page copying) will be different from vendor to vendor.

Due to existing CPU structure, like chiplet design, a single CPU scheduling
algorithm does not fit for CPUs from different vendors. For example, on
NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
threads across different CCDs can achieve much higher page copy throughput
than putting all threads in a single CCD. I assume Intel CPUs with chiplet
design would see the same result.

>
>>    c. Using Non-temporal instructions for CPU-based memcpy
>>    d. Upscaling/downscaling worker threads based on migration size, CPU
>>       availability (system load), bandwidth saturation, etc.
>> 2. Interface requirements with DMA hardware:
>>    a. Standardizing APIs for DMA drivers and support for different DMA
>>       drivers
>>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>> 3. Resources Accounting:
>>    a. CPU cgroups accounting and fairness [3]
>>    b. Who bears migration cost? - (Migration cost attribution)
>>
>> References:
>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>>

[1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com/
--
Best Regards,
Yan, Zi