linux-kernel - Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <C8E561B3-B9DB-4F58-A2C7-4EE17E08A993@nvidia.com>
Date: Tue, 23 Sep 2025 22:03:18 -0400
From: Zi Yan <ziy@...dia.com>
To: "Huang, Ying" <ying.huang@...ux.alibaba.com>
Cc: Shivank Garg <shivankg@....com>, akpm@...ux-foundation.org,
 david@...hat.com, willy@...radead.org, matthew.brost@...el.com,
 joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
 gourry@...rry.net, apopple@...dia.com, lorenzo.stoakes@...cle.com,
 Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com,
 mhocko@...e.com, vkoul@...nel.org, lucas.demarchi@...el.com,
 rdunlap@...radead.org, jgg@...pe.ca, kuba@...nel.org, justonli@...omium.org,
 ivecera@...hat.com, dave.jiang@...el.com, Jonathan.Cameron@...wei.com,
 dan.j.williams@...el.com, rientjes@...gle.com,
 Raghavendra.KodsaraThimmappa@....com, bharata@....com,
 alirad.malek@...corp.com, yiannis@...corp.com, weixugc@...gle.com,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC V3 0/9] Accelerate page migration with batch copying and
 hardware offload

On 23 Sep 2025, at 21:49, Huang, Ying wrote:

> Hi, Shivank,
>
> Thanks for working on this!
>
> Shivank Garg <shivankg@....com> writes:
>
>> This is the third RFC of the patchset to enhance page migration by batching
>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>> DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration
>> in modern systems with deep memory hierarchies, especially for large
>> folios where copy overhead dominates, leaving significant hardware
>> potential untapped.
>>
>> By batching the copy phase, we create an opportunity for significant
>> hardware acceleration. This series builds a framework for this acceleration
>> and provides two initial offload driver implementations: one using multiple
>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>
>> This version incorporates significant feedback to improve correctness,
>> robustness, and the efficiency of the DMA offload path.
>>
>> Changelog since V2:
>>
>> 1. DMA Engine Rewrite:
>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>    - Single completion interrupt per batch (reduced overhead)
>>    - Order of magnitude improvement in setup time for large batches
>> 2. Code cleanups and refactoring
>> 3. Rebased on latest mainline (6.17-rc6+)
>>
>> MOTIVATION:
>> -----------
>>
>> Current Migration Flow:
>> [ move_pages(), Compaction, Tiering, etc. ]
>>               |
>>               v
>>      [ migrate_pages() ] // Common entry point
>>               |
>>               v
>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>       |
>>       |--> [ migrate_folio_unmap() ]
>>       |
>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>       |
>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>            - For each folio:
>>              - Metadata prep: Copy flags, mappings, etc.
>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>              - Update PTEs & finalize for that single folio.
>>
>> Understanding overheads in page migration (move_pages() syscall):
>>
>> Total move_pages() overheads = folio_copy() + Other overheads
>> 1. folio_copy() is the core copy operation that interests us.
>> 2. The remaining operations are user/kernel transitions, page table walks,
>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>> mappings and PTEs etc. that contribute to the remaining overheads.
>>
>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>> Number of pages being migrated and folio size:
>>             4KB     2MB
>> 1 page     <1%     ~66%
>> 512 page   ~35%    ~97%
>>
>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>> substantial performance opportunity.
>>
>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>> folio_copy().
>>
>> For 4KB folios, folio copy overheads are significantly small in single-page
>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>
>> For 2MB THPs, folio copy overheads are significant even in single page
>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>> speedup and up to ~33x for 512 pages.
>>
>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>> based on my measurements for copying 512 2MB pages.
>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>> observed in the experiments below).
>>
>> DESIGN: A Pluggable Migrator Framework
>> ---------------------------------------
>>
>> Introduce migrate_folios_batch_move():
>>
>> [ migrate_pages_batch() ]
>>     |
>>     |--> migrate_folio_unmap()
>>     |
>>     |--> try_to_unmap_flush()
>>     |
>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>             |
>>             |--> Metadata migration
>>             |    - Metadata prep: Copy flags, mappings, etc.
>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>             |
>>             |--> Batch copy folio data
>>             |    - Migrator is configurable at runtime via sysfs.
>>             |
>>             |          static_call(_folios_copy) // Pluggable migrators
>>             |          /          |            \
>>             |         v           v             v
>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>             |
>>             +--> Update PTEs to point to dst folios and complete migration.
>>
>
> I just jump in the discussion, so this may be discussed before already.
> Sorry if so.  Why not
>
> migrate_folios_unmap()
> try_to_unmap_flush()
> copy folios in parallel if possible
> migrate_folios_move(): with MIGRATE_NO_COPY?

Since in move_to_new_folio(), there are various migration preparation
works, which can fail. Copying folios regardless might lead to some
unnecessary work. What is your take on this?

>
>> User Control of Migrator:
>>
>> # echo 1 > /sys/kernel/dcbm/offloading
>>    |
>>    +--> Driver's sysfs handler
>>         |
>>         +--> calls start_offloading(&cpu_migrator)
>>               |
>>               +--> calls offc_update_migrator()
>>                     |
>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>
>> Later, During Migration ...
>> migrate_folios_batch_move()
>>     |
>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>           |
>>           +-> [ mtcopy | dcbm | kernel_default ]
>>
>
> [snip]
>
> ---
> Best Regards,
> Huang, Ying


Best Regards,
Yan, Zi