linux-kernel - Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3b59ea3e-04db-ad38-97b1-20cff0f8f17c@google.com>
Date: Sun, 26 Jan 2025 22:55:48 -0800 (PST)
From: David Rientjes <rientjes@...gle.com>
To: Shivank Garg <shivankg@....com>
cc: akpm@...ux-foundation.org, lsf-pc@...ts.linux-foundation.org, 
    linux-mm@...ck.org, ziy@...dia.com, AneeshKumar.KizhakeVeetil@....com, 
    baolin.wang@...ux.alibaba.com, bharata@....com, david@...hat.com, 
    gregory.price@...verge.com, honggyu.kim@...com, jane.chu@...cle.com, 
    jhubbard@...dia.com, jon.grimm@....com, k.shutemov@...il.com, 
    leesuyeon0506@...il.com, leillc@...gle.com, liam.howlett@...cle.com, 
    linux-kernel@...r.kernel.org, mel.gorman@...il.com, Michael.Day@....com, 
    Raghavendra.KodsaraThimmappa@....com, riel@...riel.com, 
    santosh.shukla@....com, shy828301@...il.com, sj@...nel.org, 
    wangkefeng.wang@...wei.com, weixugc@...gle.com, willy@...radead.org, 
    ying.huang@...ux.alibaba.com
Subject: Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with
 Multi-threading and Batch Offloading to DMA

On Thu, 23 Jan 2025, Shivank Garg wrote:

> Hi all,
> 
> Zi Yan and I would like to propose the topic: Enhancements to Page
> Migration with Multi-threading and Batch Offloading to DMA.
> 

I think this would be a very useful topic to discuss, thanks for proposing 
it.

> Page migration is a critical operation in NUMA systems that can incur
> significant overheads, affecting memory management performance across
> various workloads. For example, copying folios between DRAM NUMA nodes
> can take ~25% of the total migration cost for migrating 256MB of data.
> 
> Modern systems are equipped with powerful DMA engines for bulk data
> copying, GPUs, and high CPU core counts. Leveraging these hardware
> capabilities becomes essential for systems where frequent page promotion
> and demotion occur - from large-scale tiered-memory systems with CXL nodes
> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
> 

Indeed, there are multiple use cases for optimizations in this area.  With 
the ramp of memory tiered systems, I think there will be an even greater 
reliance on memory migration going forward.

Do you have numbers to share on how offloading, even as a proof of 
concept, moves the needle compared to traditional and sequential memory 
migration?

> Existing page migration performs sequential page copying, underutilizing
> modern CPU architectures and high-bandwidth memory subsystems.
> 
> We have proposed and posted RFCs to enhance page migration through three
> key techniques:
> 1. Batching migration operations for bulk copying data [1]
> 2. Multi-threaded folio copying [2]
> 3. DMA offloading to hardware accelerators [1]
> 

Curious: does memory migration of pages that are actively undergoing DMA 
with hardware assist fit into any of these?

> By employing batching and multi-threaded folio copying, we are able to
> achieve significant improvements in page migration throughput for large
> pages.
> 
> Discussion points:
> 1. Performance:
>    a. Policy decision for DMA and CPU selection
>    b. Platform-specific scheduling of folio-copy worker threads for better
>       bandwidth utilization

Why platform specific?  I *assume* this means a generic framework that can 
optimize for scheduling based on the underlying hardware and not specific 
implementations that can only be used on AMD, for example.  Is that the 
case?

>    c. Using Non-temporal instructions for CPU-based memcpy
>    d. Upscaling/downscaling worker threads based on migration size, CPU
>       availability (system load), bandwidth saturation, etc.
> 2. Interface requirements with DMA hardware:
>    a. Standardizing APIs for DMA drivers and support for different DMA
>       drivers
>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
> 3. Resources Accounting:
>    a. CPU cgroups accounting and fairness [3]
>    b. Who bears migration cost? - (Migration cost attribution)
> 
> References:
> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>