lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <03b8be5c-64e0-4577-b4f0-0d505eff04bf@amd.com>
Date: Wed, 12 Nov 2025 20:10:58 +0800
From: "Honglei1.Huang@....com" <honghuan@....com>
To: Christian König <christian.koenig@....com>
Cc: Felix.Kuehling@....com, alexander.deucher@....com, Ray.Huang@....com,
 dmitry.osipenko@...labora.com, Xinhui.Pan@....com, airlied@...il.com,
 daniel@...ll.ch, amd-gfx@...ts.freedesktop.org,
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, akpm@...ux-foundation.org,
 Honglei Huang <honglei1.huang@....com>
Subject: Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration
 support

Hi Christian,

Really thanks for the detailed feedback and insights. Your comments are 
incredibly helpful and clear.

On 2025/11/12 16:34, Christian König wrote:
> Hi,
> 
> On 11/12/25 08:29, Honglei Huang wrote:
>> Hi all,
>>
>> This RFC patch series introduces a new mechanism for batch registration of
>> multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
>> call. The primary goal of this series is to start a discussion about the best
>> approach to handle scattered user memory allocations in GPU workloads.
>>
>> Background and Motivation
>> ==========================
>>
>> Current applications using ROCm/HSA often need to register many scattered
>> memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
>> existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
>> leading to:
>> - Blocking issue in some special use cases with many memory ranges
>> - High system call overhead when dealing with dozens or hundreds of ranges
>> - Inefficient resource management
>> - Complexity in userspace applications
>>
>> Use Case Example
>> ================
>>
>> Consider a typical ML/HPC workload that allocates 100+ small buffers across
>> different parts of the address space. Currently, this requires 100+ separate
>> ioctl calls. The proposed batch interface reduces this to a single call.
> 
> Yeah, that's an intentional limitation.
> 
> In an IOCTL interface you usually need to guarantee that the operation either completes or fails in a transactional manner.
> 
> It is possible to implement this, but usually rather tricky if you do multiple operations in a single IOCTL. So you really need a good use case to justify the added complexity.
> 

You're absolutely right about the transactional complexity. This 
operation indeed requires proper rollback mechanisms and error handling 
to maintain atomicity.


>> Paravirtualized environments exacerbate this issue, as KVM's memory backing
>> is often non-contiguous at the host level. In virtualized environments, guest
>> physical memory appears contiguous to the VM but is actually scattered across
>> host memory pages. This fragmentation means that what appears as a single
>> large allocation in the guest may require multiple discrete SVM registrations
>> to properly handle the underlying host memory layout, further multiplying the
>> number of required ioctl calls.
> SVM with dynamic migration under KVM is most likely a dead end to begin with.
> 
> The only possibility to implement it is with memory pinning which is basically userptr.
> 
> Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.
> 
> Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf
> 

This is very helpful context.Your confirmation that memory pinning 
(userptr-style) is the practical approach helps me understand that what 
I initially saw as a "workaround" is actually the intended solution for 
this use case.
For colIOMMU, I'll study it to better understand the alternatives and 
their trade-offs.

>> Current Implementation - A Workaround Approach
>> ===============================================
>>
>> This patch series implements a WORKAROUND solution that pins user pages in
>> memory to enable batch registration. While functional, this approach has
>> several significant limitations:
>>
>> **Major Concern: Memory Pinning**
>> - The implementation uses pin_user_pages_fast() to lock pages in RAM
>> - This defeats the purpose of SVM's on-demand paging mechanism
>> - Prevents memory oversubscription and dynamic migration
>> - May cause memory pressure on systems with limited RAM
>> - Goes against the fundamental design philosophy of HMM-based SVM
> 
> That again is perfectly intentional. Any other mode doesn't really make sense with KVM.
> 
>> **Known Limitations:**
>> 1. Increased memory footprint due to pinned pages
>> 2. Potential for memory fragmentation
>> 3. No support for transparent huge pages in pinned regions
>> 4. Limited interaction with memory cgroups and resource controls
>> 5. Complexity in handling VMA operations and lifecycle management
>> 6. May interfere with NUMA optimization and page migration
>>
>> Why Submit This RFC?
>> ====================
>>
>> Despite the limitations above, I am submitting this series to:
>>
>> 1. **Start the Discussion**: I want community feedback on whether batch
>>     registration is a useful feature worth pursuing.
>>
>> 2. **Explore Better Alternatives**: Is there a way to achieve batch
>>     registration without pinning? Could I extend HMM to better support
>>     this use case?
> 
> There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.
> 
> Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.
> 
> Regards,
> Christian.

Thank you for sharing this critical information. Is there a public 
discussion forum or mailing list for the KFD/KGD unification where I 
could follow progress and understand the design direction?

Regarding the use case justification: I need to be honest here - the
primary driver for this feature is indeed KVM/virtualized environments.
The scattered allocation problem exists in native environments too, but
the overhead is tolerable there. However, I do want to raise one 
consideration for the unified interface design:

GPU computing in virtualized/cloud environments is growing rapidly, 
major cloud providers (AWS, Azure) now offer GPU instances ROCm in 
containers/VMs is becoming more common.So while my current use case is 
specific to KVM, the virtualized GPU workload pattern may become more 
prevalent.

So during the unified interface design, please keep the door open for 
batch-style operations if they don't complicate the core design.

I really appreciate your time and guidance on this.

Regards,
Honglei



> 
>>
>> 3. **Understand Trade-offs**: For some workloads, the performance benefit
>>     of batch registration might outweigh the drawbacks of pinning. I'd
>>     like to understand where the balance lies.
>>
>> Questions for the Community
>> ============================
>>
>> 1. Are there existing mechanisms in HMM or mm that could support batch
>>     operations without pinning?
>>
>> 2. Would a different approach (e.g., async registration, delayed validation)
>>     be more acceptable?
>>
>> Alternative Approaches Considered
>> ==================================
>>
>> I've considered several alternatives:
>>
>> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
>>
>> B) **Userspace batching library**: Hide multiple ioctls behind a library.
>>
>> Patch Series Overview
>> =====================
>>
>> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
>> Patch 2: Define data structures for batch SVM range registration
>> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
>> Patch 4: Implement page pinning mechanism for scattered ranges
>> Patch 5: Wire up the ioctl handler and attribute processing
>>
>> Testing
>> =======
>>
>> The series has been tested with:
>> - Multiple scattered malloc() allocations (2-2000+ ranges)
>> - Various allocation sizes (4KB to 1G+)
>> - GPU compute workloads using the registered ranges
>> - Memory pressure scenarios
>> - OpecnCL CTS in KVM guest environment
>> - HIP catch tests in KVM guest environment
>> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
>>    on HuggingFace transformers
>>
>> I understand this approach is not ideal and are committed to working on a
>> better solution based on community feedback. This RFC is the starting point
>> for that discussion.
>>
>> Thank you for your time and consideration.
>>
>> Best regards,
>> Honglei Huang
>>
>> ---
>>
>> Honglei Huang (5):
>>    drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
>>    drm/amdkfd: Add SVM ranges data structures
>>    drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
>>    drm/amdkfd: Add support for pinned user pages in SVM ranges
>>    drm/amdkfd: Wire up SVM ranges ioctl handler
>>
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++++++
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 +++++++++++++++++++++++++++++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
>>   include/uapi/linux/kfd_ioctl.h           |  52 +++++++-
>>   4 files changed, 348 insertions(+), 6 deletions(-)
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ