[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20251112073549.3717925-1-honglei1.huang@amd.com>
Date: Wed, 12 Nov 2025 15:35:49 +0800
From: Honglei Huang <honglei1.huang@....com>
To: <Felix.Kuehling@....com>, <alexander.deucher@....com>,
<christian.koenig@....com>, <Ray.Huang@....com>
CC: <dmitry.osipenko@...labora.com>, <Xinhui.Pan@....com>,
<airlied@...il.com>, <daniel@...ll.ch>, <amd-gfx@...ts.freedesktop.org>,
<dri-devel@...ts.freedesktop.org>, <linux-kernel@...r.kernel.org>,
<linux-mm@...ck.org>, <akpm@...ux-foundation.org>, <honghuan@....com>,
Honglei Huang <Honglei1.Huang@....com>
Subject: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
From: Honglei Huang <Honglei1.Huang@....com>
Hi all,
This RFC patch series introduces a new mechanism for batch registration of
multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
call. The primary goal of this series is to start a discussion about the best
approach to handle scattered user memory allocations in GPU workloads.
Background and Motivation
==========================
Current applications using ROCm/HSA often need to register many scattered
memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
leading to:
- Blocking issue in some special use cases with many memory ranges
- High system call overhead when dealing with dozens or hundreds of ranges
- Inefficient resource management
- Complexity in userspace applications
Use Case Example
================
Consider a typical ML/HPC workload that allocates 100+ small buffers across
different parts of the address space. Currently, this requires 100+ separate
ioctl calls. The proposed batch interface reduces this to a single call.
Paravirtualized environments exacerbate this issue, as KVM's memory backing
is often non-contiguous at the host level. In virtualized environments, guest
physical memory appears contiguous to the VM but is actually scattered across
host memory pages. This fragmentation means that what appears as a single
large allocation in the guest may require multiple discrete SVM registrations
to properly handle the underlying host memory layout, further multiplying the
number of required ioctl calls.
Current Implementation - A Workaround Approach
===============================================
This patch series implements a WORKAROUND solution that pins user pages in
memory to enable batch registration. While functional, this approach has
several significant limitations:
**Major Concern: Memory Pinning**
- The implementation uses pin_user_pages_fast() to lock pages in RAM
- This defeats the purpose of SVM's on-demand paging mechanism
- Prevents memory oversubscription and dynamic migration
- May cause memory pressure on systems with limited RAM
- Goes against the fundamental design philosophy of HMM-based SVM
**Known Limitations:**
1. Increased memory footprint due to pinned pages
2. Potential for memory fragmentation
3. No support for transparent huge pages in pinned regions
4. Limited interaction with memory cgroups and resource controls
5. Complexity in handling VMA operations and lifecycle management
6. May interfere with NUMA optimization and page migration
Why Submit This RFC?
====================
Despite the limitations above, I am submitting this series to:
1. **Start the Discussion**: I want community feedback on whether batch
registration is a useful feature worth pursuing.
2. **Explore Better Alternatives**: Is there a way to achieve batch
registration without pinning? Could I extend HMM to better support
this use case?
3. **Understand Trade-offs**: For some workloads, the performance benefit
of batch registration might outweigh the drawbacks of pinning. I'd
like to understand where the balance lies.
Questions for the Community
============================
1. Are there existing mechanisms in HMM or mm that could support batch
operations without pinning?
2. Would a different approach (e.g., async registration, delayed validation)
be more acceptable?
Alternative Approaches Considered
==================================
I've considered several alternatives:
A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
B) **Userspace batching library**: Hide multiple ioctls behind a library.
Patch Series Overview
=====================
Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
Patch 2: Define data structures for batch SVM range registration
Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
Patch 4: Implement page pinning mechanism for scattered ranges
Patch 5: Wire up the ioctl handler and attribute processing
Testing
=======
The series has been tested with:
- Multiple scattered malloc() allocations (2-2000+ ranges)
- Various allocation sizes (4KB to 1G+)
- GPU compute workloads using the registered ranges
- Memory pressure scenarios
- OpecnCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
on HuggingFace transformers
I understand this approach is not ideal and are committed to working on a
better solution based on community feedback. This RFC is the starting point
for that discussion.
Thank you for your time and consideration.
Best regards,
Honglei Huang
Honglei Huang (5):
drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
drm/amdkfd: Add SVM ranges data structures
drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
drm/amdkfd: Add support for pinned user pages in SVM ranges
drm/amdkfd: Wire up SVM ranges ioctl handler
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 +++++++
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 ++++++++++++++++++++++-
drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +
include/uapi/linux/kfd_ioctl.h | 52 ++++-
4 files changed, 345 insertions(+), 9 deletions(-)
--
2.34.1
Powered by blists - more mailing lists