linux-kernel - Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <633F4EFC-13A9-40DF-A27D-DBBDD0AF44F3@nvidia.com>
Date: Tue, 23 Sep 2025 23:22:03 -0400
From: Zi Yan <ziy@...dia.com>
To: Shivank Garg <shivankg@....com>
Cc: akpm@...ux-foundation.org, david@...hat.com, willy@...radead.org,
 matthew.brost@...el.com, joshua.hahnjy@...il.com, rakie.kim@...com,
 byungchul@...com, gourry@...rry.net, ying.huang@...ux.alibaba.com,
 apopple@...dia.com, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
 vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
 vkoul@...nel.org, lucas.demarchi@...el.com, rdunlap@...radead.org,
 jgg@...pe.ca, kuba@...nel.org, justonli@...omium.org, ivecera@...hat.com,
 dave.jiang@...el.com, Jonathan.Cameron@...wei.com, dan.j.williams@...el.com,
 rientjes@...gle.com, Raghavendra.KodsaraThimmappa@....com, bharata@....com,
 alirad.malek@...corp.com, yiannis@...corp.com, weixugc@...gle.com,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC V3 0/9] Accelerate page migration with batch copying and
 hardware offload

On 23 Sep 2025, at 13:47, Shivank Garg wrote:

> This is the third RFC of the patchset to enhance page migration by batching
> folio-copy operations and enabling acceleration via multi-threaded CPU or
> DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration
> in modern systems with deep memory hierarchies, especially for large
> folios where copy overhead dominates, leaving significant hardware
> potential untapped.
>
> By batching the copy phase, we create an opportunity for significant
> hardware acceleration. This series builds a framework for this acceleration
> and provides two initial offload driver implementations: one using multiple
> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>
> This version incorporates significant feedback to improve correctness,
> robustness, and the efficiency of the DMA offload path.
>
> Changelog since V2:
>
> 1. DMA Engine Rewrite:
>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>    - Single completion interrupt per batch (reduced overhead)
>    - Order of magnitude improvement in setup time for large batches
> 2. Code cleanups and refactoring
> 3. Rebased on latest mainline (6.17-rc6+)

Thanks for working on this.

It is better to rebase on top of Andrew’s mm-new tree.

I have a version at: https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy_amd_v3-mm-everything-2025-09-23-00-13.

The difference is that I changed Patch 6 to use padata_do_multithreaded()
instead of my own implementation, since padata is a nice framework
for doing multithreaded jobs. The downside is that your patch 9
no longer applies and you will need to hack kernel/padata.c to
achieve the same thing.

I also tried to attribute back page copy kthread time to the initiating
thread so that page copy time does not disappear when it is parallelized
using CPU threads. It is currently a hack in the last patch from
the above repo. With the patch, I can see system time of a page migration
process with multithreaded page copy looks almost the same as without it,
while wall clock time is smaller. But I have not found time to ask
scheduler people about a proper implementation yet.


>
> MOTIVATION:
> -----------
>
> Current Migration Flow:
> [ move_pages(), Compaction, Tiering, etc. ]
>               |
>               v
>      [ migrate_pages() ] // Common entry point
>               |
>               v
>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>       |
>       |--> [ migrate_folio_unmap() ]
>       |
>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>       |
>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>            - For each folio:
>              - Metadata prep: Copy flags, mappings, etc.
>              - folio_copy()  <-- Single-threaded, serial data copy.
>              - Update PTEs & finalize for that single folio.
>
> Understanding overheads in page migration (move_pages() syscall):
>
> Total move_pages() overheads = folio_copy() + Other overheads
> 1. folio_copy() is the core copy operation that interests us.
> 2. The remaining operations are user/kernel transitions, page table walks,
> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
> mappings and PTEs etc. that contribute to the remaining overheads.
>
> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
> Number of pages being migrated and folio size:
>             4KB     2MB
> 1 page     <1%     ~66%
> 512 page   ~35%    ~97%
>
> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
> substantial performance opportunity.
>
> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
> Where F is the fraction of time spent in folio_copy() and S is the speedup of
> folio_copy().
>
> For 4KB folios, folio copy overheads are significantly small in single-page
> migrations to impact overall speedup, even for 512 pages, maximum theoretical
> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>
> For 2MB THPs, folio copy overheads are significant even in single page
> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
> speedup and up to ~33x for 512 pages.
>
> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
> based on my measurements for copying 512 2MB pages.
> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
> observed in the experiments below).
>
> DESIGN: A Pluggable Migrator Framework
> ---------------------------------------
>
> Introduce migrate_folios_batch_move():
>
> [ migrate_pages_batch() ]
>     |
>     |--> migrate_folio_unmap()
>     |
>     |--> try_to_unmap_flush()
>     |
>     +--> [ migrate_folios_batch_move() ] // new batched design
>             |
>             |--> Metadata migration
>             |    - Metadata prep: Copy flags, mappings, etc.
>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>             |
>             |--> Batch copy folio data
>             |    - Migrator is configurable at runtime via sysfs.
>             |
>             |          static_call(_folios_copy) // Pluggable migrators
>             |          /          |            \
>             |         v           v             v
>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>             |
>             +--> Update PTEs to point to dst folios and complete migration.
>
>
> User Control of Migrator:
>
> # echo 1 > /sys/kernel/dcbm/offloading
>    |
>    +--> Driver's sysfs handler
>         |
>         +--> calls start_offloading(&cpu_migrator)
>               |
>               +--> calls offc_update_migrator()
>                     |
>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>
> Later, During Migration ...
> migrate_folios_batch_move()
>     |
>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>           |
>           +-> [ mtcopy | dcbm | kernel_default ]
>
>
> PERFORMANCE RESULTS:
> --------------------
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
> PTDMA hardware.
>
> Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.
>
> 1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
> (1GB), with different number of parallel threads/channels.
> Metric: Throughput is reported in GB/s.
>
> a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):
>
> Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |
>
> b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):
>
> Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> 1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
> 2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
> 4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
> 8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
> 12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
> 16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |
>
> c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):
>
> Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> 1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
> 2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
> 4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
> 8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
> 12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
> 16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |
>
> - Throughput increases with folio size. Larger folios benefit more from DMA.
> - Scaling shows diminishing returns beyond 8-12 threads/channels.
> - Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.
>
> 2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
>    using only single thread/channel
>
> folio_cnt | Baseline    | MTCPU      | DMA
> ====================================================
> 1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
> 8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
> 16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
> 32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
> 64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
> 256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
> 512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
> 2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
> 8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |
>
> - Throughput increases with folios count but plateaus after a threshold.
>   (The migrate_pages function uses a folio batch size of 512)
>
> Performance Analysis (V2 vs V3):
>
> The new SG-based DMA driver dramatically reduces software overhead. By
> switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
> time improves by an order of magnitude for large batches.
> This is most visible with 4KB folios, making DMA viable even for smaller
> page sizes. For 2MB THP migrations, where hardware transfer time is more
> dominant, the gains are more modest.
>
> OPEN QUESTIONS:
> ---------------
>
> User-Interface:
>
> 1. Control Interface Design:
> The current interface creates separate sysfs files
> for each driver, which can be confusing. Should we implement a unified interface
> (/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
> ("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
> Is this the right approach?
>
> 2. Dynamic Migrator Selection:
> Currently, active migrator is a global state, and only one can be active a time.
> A more flexible approach might be for the caller of migrate_pages() to specify/hint which
> offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
> multi-threaded CPU copy.
>
> 3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
> and thresholds for using migrators. Who should own these parameters?
>
> 4. Resources Accounting[3]:
> a. CPU cgroups accounting and fairness
> b. Migration cost attribution
>
> FUTURE WORK:
> ------------
>
> 1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
> 2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
> 3. Enable kpromoted [4] to use the migration offload infrastructure.
>
> EARLIER POSTINGS:
> -----------------
>
> - RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> - RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>
> REFERENCES:
> -----------
>
> [1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
> [4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com
>
> Mike Day (1):
>   mm: add support for copy offload for folio Migration
>
> Shivank Garg (4):
>   mm: Introduce folios_mc_copy() for batch copying folios
>   mm/migrate: add migrate_folios_batch_move to  batch the folio move
>     operations
>   dcbm: add dma core batch migrator for batch page offloading
>   mtcopy: spread threads across die for testing
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
>   mtcopy: introduce multi-threaded page copy routine
>   adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                        |   2 +
>  drivers/Makefile                       |   3 +
>  drivers/migoffcopy/Kconfig             |  17 +
>  drivers/migoffcopy/Makefile            |   2 +
>  drivers/migoffcopy/dcbm/Makefile       |   1 +
>  drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>  drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
>  include/linux/migrate_mode.h           |   2 +
>  include/linux/migrate_offc.h           |  34 ++
>  include/linux/mm.h                     |   2 +
>  mm/Kconfig                             |   8 +
>  mm/Makefile                            |   1 +
>  mm/migrate.c                           | 358 ++++++++++++++++++---
>  mm/migrate_offc.c                      |  58 ++++
>  mm/util.c                              |  29 ++
>  16 files changed, 1284 insertions(+), 46 deletions(-)
>  create mode 100644 drivers/migoffcopy/Kconfig
>  create mode 100644 drivers/migoffcopy/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
>  create mode 100644 include/linux/migrate_offc.h
>  create mode 100644 mm/migrate_offc.c
>
> -- 
> 2.43.0


Best Regards,
Yan, Zi