[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250923174752.35701-1-shivankg@amd.com>
Date: Tue, 23 Sep 2025 17:47:35 +0000
From: Shivank Garg <shivankg@....com>
To: <akpm@...ux-foundation.org>, <david@...hat.com>
CC: <ziy@...dia.com>, <willy@...radead.org>, <matthew.brost@...el.com>,
<joshua.hahnjy@...il.com>, <rakie.kim@...com>, <byungchul@...com>,
<gourry@...rry.net>, <ying.huang@...ux.alibaba.com>, <apopple@...dia.com>,
<lorenzo.stoakes@...cle.com>, <Liam.Howlett@...cle.com>, <vbabka@...e.cz>,
<rppt@...nel.org>, <surenb@...gle.com>, <mhocko@...e.com>,
<vkoul@...nel.org>, <lucas.demarchi@...el.com>, <rdunlap@...radead.org>,
<jgg@...pe.ca>, <kuba@...nel.org>, <justonli@...omium.org>,
<ivecera@...hat.com>, <dave.jiang@...el.com>, <Jonathan.Cameron@...wei.com>,
<dan.j.williams@...el.com>, <rientjes@...gle.com>,
<Raghavendra.KodsaraThimmappa@....com>, <bharata@....com>,
<shivankg@....com>, <alirad.malek@...corp.com>, <yiannis@...corp.com>,
<weixugc@...gle.com>, <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>
Subject: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
This is the third RFC of the patchset to enhance page migration by batching
folio-copy operations and enabling acceleration via multi-threaded CPU or
DMA offload.
Single-threaded, folio-by-folio copying bottlenecks page migration
in modern systems with deep memory hierarchies, especially for large
folios where copy overhead dominates, leaving significant hardware
potential untapped.
By batching the copy phase, we create an opportunity for significant
hardware acceleration. This series builds a framework for this acceleration
and provides two initial offload driver implementations: one using multiple
CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
This version incorporates significant feedback to improve correctness,
robustness, and the efficiency of the DMA offload path.
Changelog since V2:
1. DMA Engine Rewrite:
- Switched from per-folio dma_map_page() to batch dma_map_sgtable()
- Single completion interrupt per batch (reduced overhead)
- Order of magnitude improvement in setup time for large batches
2. Code cleanups and refactoring
3. Rebased on latest mainline (6.17-rc6+)
MOTIVATION:
-----------
Current Migration Flow:
[ move_pages(), Compaction, Tiering, etc. ]
|
v
[ migrate_pages() ] // Common entry point
|
v
[ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
|
|--> [ migrate_folio_unmap() ]
|
|--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
|
|--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
- For each folio:
- Metadata prep: Copy flags, mappings, etc.
- folio_copy() <-- Single-threaded, serial data copy.
- Update PTEs & finalize for that single folio.
Understanding overheads in page migration (move_pages() syscall):
Total move_pages() overheads = folio_copy() + Other overheads
1. folio_copy() is the core copy operation that interests us.
2. The remaining operations are user/kernel transitions, page table walks,
locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
mappings and PTEs etc. that contribute to the remaining overheads.
Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
Number of pages being migrated and folio size:
4KB 2MB
1 page <1% ~66%
512 page ~35% ~97%
Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
substantial performance opportunity.
move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
Where F is the fraction of time spent in folio_copy() and S is the speedup of
folio_copy().
For 4KB folios, folio copy overheads are significantly small in single-page
migrations to impact overall speedup, even for 512 pages, maximum theoretical
speedup is limited to ~1.54x with infinite folio_copy() speedup.
For 2MB THPs, folio copy overheads are significant even in single page
migrations, with a theoretical speedup of ~3x with infinite folio_copy()
speedup and up to ~33x for 512 pages.
A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
based on my measurements for copying 512 2MB pages.
This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
observed in the experiments below).
DESIGN: A Pluggable Migrator Framework
---------------------------------------
Introduce migrate_folios_batch_move():
[ migrate_pages_batch() ]
|
|--> migrate_folio_unmap()
|
|--> try_to_unmap_flush()
|
+--> [ migrate_folios_batch_move() ] // new batched design
|
|--> Metadata migration
| - Metadata prep: Copy flags, mappings, etc.
| - Use MIGRATE_NO_COPY to skip the actual data copy.
|
|--> Batch copy folio data
| - Migrator is configurable at runtime via sysfs.
|
| static_call(_folios_copy) // Pluggable migrators
| / | \
| v v v
| [ Default ] [ MT CPU copy ] [ DMA Offload ]
|
+--> Update PTEs to point to dst folios and complete migration.
User Control of Migrator:
# echo 1 > /sys/kernel/dcbm/offloading
|
+--> Driver's sysfs handler
|
+--> calls start_offloading(&cpu_migrator)
|
+--> calls offc_update_migrator()
|
+--> static_call_update(_folios_copy, mig->migrate_offc)
Later, During Migration ...
migrate_folios_batch_move()
|
+--> static_call(_folios_copy) // Now dispatches to the selected migrator
|
+-> [ mtcopy | dcbm | kernel_default ]
PERFORMANCE RESULTS:
--------------------
System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
PTDMA hardware.
Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.
1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
(1GB), with different number of parallel threads/channels.
Metric: Throughput is reported in GB/s.
a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):
Folio size|4K | 16K | 64K | 128K | 256K | 512K | 1M | 2M |
===============================================================================================================
Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56 | 6.34±0.08 | 6.50±0.05 | 6.86±0.61 | 6.92±0.71 | 10.67±0.36 |
b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):
Thread | 4K | 16K | 64K | 128K | 256K | 512K | 1M | 2M |
===============================================================================================================
1 | 3.84±0.10 | 5.23±0.31 | 6.01±0.55 | 6.34±0.60 | 7.16±1.00 | 7.12±0.78 | 7.10±0.86 | 10.94±0.13 |
2 | 4.04±0.19 | 6.72±0.38 | 7.68±0.12 | 8.15±0.06 | 8.45±0.09 | 9.29±0.17 | 9.87±1.01 | 17.80±0.12 |
4 | 4.72±0.21 | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
8 | 4.91±0.28 | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
12 | 4.84±0.24 | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
16 | 4.77±0.22 | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |
c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):
Chan Cnt| 4K | 16K | 64K | 128K | 256K | 512K | 1M | 2M |
===============================================================================================================
1 | 2.75±0.19 | 2.86±0.13 | 3.28±0.20 | 4.57±0.72 | 5.03±0.62 | 4.69±0.25 | 4.78±0.34 | 12.50±0.24 |
2 | 3.35±0.19 | 4.57±0.19 | 5.35±0.55 | 6.71±0.71 | 7.40±1.07 | 7.38±0.61 | 7.21±0.73 | 14.23±0.34 |
4 | 4.01±0.17 | 6.36±0.26 | 7.71±0.89 | 9.40±1.35 | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
8 | 4.46±0.16 | 7.74±0.13 | 9.72±1.29 | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
12 | 4.60±0.22 | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
16 | 4.61±0.25 | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |
- Throughput increases with folio size. Larger folios benefit more from DMA.
- Scaling shows diminishing returns beyond 8-12 threads/channels.
- Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.
2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
using only single thread/channel
folio_cnt | Baseline | MTCPU | DMA
====================================================
1 | 7.96±2.22 | 6.43±0.66 | 6.52±0.45 |
8 | 8.20±0.75 | 8.82±1.10 | 8.88±0.54 |
16 | 7.54±0.61 | 9.06±0.95 | 9.03±0.62 |
32 | 8.68±0.77 | 10.11±0.42 | 10.17±0.50 |
64 | 9.08±1.03 | 10.12±0.44 | 11.21±0.24 |
256 | 10.53±0.39 | 10.77±0.28 | 12.43±0.12 |
512 | 10.59±0.29 | 10.81±0.19 | 12.61±0.07 |
2048 | 10.86±0.26 | 11.05±0.05 | 12.75±0.03 |
8192 | 10.84±0.18 | 11.12±0.05 | 12.81±0.02 |
- Throughput increases with folios count but plateaus after a threshold.
(The migrate_pages function uses a folio batch size of 512)
Performance Analysis (V2 vs V3):
The new SG-based DMA driver dramatically reduces software overhead. By
switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
time improves by an order of magnitude for large batches.
This is most visible with 4KB folios, making DMA viable even for smaller
page sizes. For 2MB THP migrations, where hardware transfer time is more
dominant, the gains are more modest.
OPEN QUESTIONS:
---------------
User-Interface:
1. Control Interface Design:
The current interface creates separate sysfs files
for each driver, which can be confusing. Should we implement a unified interface
(/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
Is this the right approach?
2. Dynamic Migrator Selection:
Currently, active migrator is a global state, and only one can be active a time.
A more flexible approach might be for the caller of migrate_pages() to specify/hint which
offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
multi-threaded CPU copy.
3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
and thresholds for using migrators. Who should own these parameters?
4. Resources Accounting[3]:
a. CPU cgroups accounting and fairness
b. Migration cost attribution
FUTURE WORK:
------------
1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
3. Enable kpromoted [4] to use the migration offload infrastructure.
EARLIER POSTINGS:
-----------------
- RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
- RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
REFERENCES:
-----------
[1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
[2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
[4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com
Mike Day (1):
mm: add support for copy offload for folio Migration
Shivank Garg (4):
mm: Introduce folios_mc_copy() for batch copying folios
mm/migrate: add migrate_folios_batch_move to batch the folio move
operations
dcbm: add dma core batch migrator for batch page offloading
mtcopy: spread threads across die for testing
Zi Yan (4):
mm/migrate: factor out code in move_to_new_folio() and
migrate_folio_move()
mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
mtcopy: introduce multi-threaded page copy routine
adjust NR_MAX_BATCHED_MIGRATION for testing
drivers/Kconfig | 2 +
drivers/Makefile | 3 +
drivers/migoffcopy/Kconfig | 17 +
drivers/migoffcopy/Makefile | 2 +
drivers/migoffcopy/dcbm/Makefile | 1 +
drivers/migoffcopy/dcbm/dcbm.c | 415 +++++++++++++++++++++++++
drivers/migoffcopy/mtcopy/Makefile | 1 +
drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
include/linux/migrate_mode.h | 2 +
include/linux/migrate_offc.h | 34 ++
include/linux/mm.h | 2 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/migrate.c | 358 ++++++++++++++++++---
mm/migrate_offc.c | 58 ++++
mm/util.c | 29 ++
16 files changed, 1284 insertions(+), 46 deletions(-)
create mode 100644 drivers/migoffcopy/Kconfig
create mode 100644 drivers/migoffcopy/Makefile
create mode 100644 drivers/migoffcopy/dcbm/Makefile
create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
create mode 100644 drivers/migoffcopy/mtcopy/Makefile
create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
create mode 100644 include/linux/migrate_offc.h
create mode 100644 mm/migrate_offc.c
--
2.43.0
Powered by blists - more mailing lists