[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tt0sfst3.fsf@DESKTOP-5N7EMDA>
Date: Wed, 24 Sep 2025 11:11:36 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Zi Yan <ziy@...dia.com>
Cc: Shivank Garg <shivankg@....com>, akpm@...ux-foundation.org,
david@...hat.com, willy@...radead.org, matthew.brost@...el.com,
joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
gourry@...rry.net, apopple@...dia.com, lorenzo.stoakes@...cle.com,
Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
surenb@...gle.com, mhocko@...e.com, vkoul@...nel.org,
lucas.demarchi@...el.com, rdunlap@...radead.org, jgg@...pe.ca,
kuba@...nel.org, justonli@...omium.org, ivecera@...hat.com,
dave.jiang@...el.com, Jonathan.Cameron@...wei.com,
dan.j.williams@...el.com, rientjes@...gle.com,
Raghavendra.KodsaraThimmappa@....com, bharata@....com,
alirad.malek@...corp.com, yiannis@...corp.com, weixugc@...gle.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC V3 0/9] Accelerate page migration with batch copying and
hardware offload
Zi Yan <ziy@...dia.com> writes:
> On 23 Sep 2025, at 21:49, Huang, Ying wrote:
>
>> Hi, Shivank,
>>
>> Thanks for working on this!
>>
>> Shivank Garg <shivankg@....com> writes:
>>
>>> This is the third RFC of the patchset to enhance page migration by batching
>>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>>> DMA offload.
>>>
>>> Single-threaded, folio-by-folio copying bottlenecks page migration
>>> in modern systems with deep memory hierarchies, especially for large
>>> folios where copy overhead dominates, leaving significant hardware
>>> potential untapped.
>>>
>>> By batching the copy phase, we create an opportunity for significant
>>> hardware acceleration. This series builds a framework for this acceleration
>>> and provides two initial offload driver implementations: one using multiple
>>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>>
>>> This version incorporates significant feedback to improve correctness,
>>> robustness, and the efficiency of the DMA offload path.
>>>
>>> Changelog since V2:
>>>
>>> 1. DMA Engine Rewrite:
>>> - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>> - Single completion interrupt per batch (reduced overhead)
>>> - Order of magnitude improvement in setup time for large batches
>>> 2. Code cleanups and refactoring
>>> 3. Rebased on latest mainline (6.17-rc6+)
>>>
>>> MOTIVATION:
>>> -----------
>>>
>>> Current Migration Flow:
>>> [ move_pages(), Compaction, Tiering, etc. ]
>>> |
>>> v
>>> [ migrate_pages() ] // Common entry point
>>> |
>>> v
>>> [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>> |
>>> |--> [ migrate_folio_unmap() ]
>>> |
>>> |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>> |
>>> |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>> - For each folio:
>>> - Metadata prep: Copy flags, mappings, etc.
>>> - folio_copy() <-- Single-threaded, serial data copy.
>>> - Update PTEs & finalize for that single folio.
>>>
>>> Understanding overheads in page migration (move_pages() syscall):
>>>
>>> Total move_pages() overheads = folio_copy() + Other overheads
>>> 1. folio_copy() is the core copy operation that interests us.
>>> 2. The remaining operations are user/kernel transitions, page table walks,
>>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>>> mappings and PTEs etc. that contribute to the remaining overheads.
>>>
>>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>>> Number of pages being migrated and folio size:
>>> 4KB 2MB
>>> 1 page <1% ~66%
>>> 512 page ~35% ~97%
>>>
>>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>>> substantial performance opportunity.
>>>
>>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>>> folio_copy().
>>>
>>> For 4KB folios, folio copy overheads are significantly small in single-page
>>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>>
>>> For 2MB THPs, folio copy overheads are significant even in single page
>>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>>> speedup and up to ~33x for 512 pages.
>>>
>>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>>> based on my measurements for copying 512 2MB pages.
>>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>>> observed in the experiments below).
>>>
>>> DESIGN: A Pluggable Migrator Framework
>>> ---------------------------------------
>>>
>>> Introduce migrate_folios_batch_move():
>>>
>>> [ migrate_pages_batch() ]
>>> |
>>> |--> migrate_folio_unmap()
>>> |
>>> |--> try_to_unmap_flush()
>>> |
>>> +--> [ migrate_folios_batch_move() ] // new batched design
>>> |
>>> |--> Metadata migration
>>> | - Metadata prep: Copy flags, mappings, etc.
>>> | - Use MIGRATE_NO_COPY to skip the actual data copy.
>>> |
>>> |--> Batch copy folio data
>>> | - Migrator is configurable at runtime via sysfs.
>>> |
>>> | static_call(_folios_copy) // Pluggable migrators
>>> | / | \
>>> | v v v
>>> | [ Default ] [ MT CPU copy ] [ DMA Offload ]
>>> |
>>> +--> Update PTEs to point to dst folios and complete migration.
>>>
>>
>> I just jump in the discussion, so this may be discussed before already.
>> Sorry if so. Why not
>>
>> migrate_folios_unmap()
>> try_to_unmap_flush()
>> copy folios in parallel if possible
>> migrate_folios_move(): with MIGRATE_NO_COPY?
>
> Since in move_to_new_folio(), there are various migration preparation
> works, which can fail. Copying folios regardless might lead to some
> unnecessary work. What is your take on this?
Good point, we should skip copying folios that fails the checks.
>>
>>> User Control of Migrator:
>>>
>>> # echo 1 > /sys/kernel/dcbm/offloading
>>> |
>>> +--> Driver's sysfs handler
>>> |
>>> +--> calls start_offloading(&cpu_migrator)
>>> |
>>> +--> calls offc_update_migrator()
>>> |
>>> +--> static_call_update(_folios_copy, mig->migrate_offc)
>>>
>>> Later, During Migration ...
>>> migrate_folios_batch_move()
>>> |
>>> +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>> |
>>> +-> [ mtcopy | dcbm | kernel_default ]
>>>
>>
>> [snip]
---
Best Regards,
Huang, Ying
Powered by blists - more mailing lists