[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <22fe9000-117a-4b14-a51b-1349d01772f0@amd.com>
Date: Tue, 13 Jan 2026 21:40:10 +0800
From: Honglei Huang <honghuan@....com>
To: Felix Kuehling <felix.kuehling@....com>
Cc: dmitry.osipenko@...labora.com, Xinhui.Pan@....com, airlied@...il.com,
daniel@...ll.ch, amd-gfx@...ts.freedesktop.org,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, akpm@...ux-foundation.org,
Honglei Huang <honglei1.huang@....com>, alexander.deucher@....com,
Ray.Huang@....com, Christian König
<christian.koenig@....com>
Subject: Re: [PATCH v2 0/4] drm/amdkfd: Add batch userptr allocation support
Hi Felix,
Thank you for the detailed technical guidance. You are absolutely right.
I will follow your suggestion and will work on the DRM HMM integration
path you proposed.
That said, I believe moving towards DRM render node APIs will take a
very long time. DRM currently lacks SVM support. SVM is a critical
component that affects almost every aspect of GPU computing, and porting
it to DRM is a massive engineering effort. Realistically, KFD and DRM
will likely need to coexist for the foreseeable future.
For this reason, I will also continue to simplify the current KFD-based
implementation and I work on the long-term DRM solution.
Regards,
Honglei
On 2026/1/13 04:51, Felix Kuehling wrote:
>
> On 2026-01-12 06:55, Honglei Huang wrote:
>>
>> Hi Felix,
>>
>> Thank you for the clarification about the render node transition.
>>
>> I went back and checked the relevant DRM code, and I found that it is
>> missing some infrastructure and it seems like the SVM is not supported
>> in drm.
>>
>> And most current hardware platforms utilize the KFD driver, we must
>> rely on the KFD infrastructure to enable this functionality. The DRM
>> stack currently lacks the SVM infrastructure, and building it from
>> scratch is not feasible for immediate deployment needs.
>
> As far as I can tell, you're not using any SVM infrastructure. In fact
> you specifically made the point that SVM wasn't suitable for your
> application because you wanted to map non-contiguous CPU address ranges
> into a contiguous GPU address range. So I don't understand what your
> dependency on SVM infrastructure is here.
>
> The DRM stack uses HMM under the hood for its userptr implementation,
> which should be quite similar to what KFD does. The difference is in the
> MMU notifier handling. I guess that's where some work would be needed so
> that amdgpu_mn_invalidate_range_start_gfx can invoke
> amdgpu_amdkfd_evict_userptr to stop usermode queues. Or maybe some
> allocation flag in the userptr BO that tells amdgpu_hmm_register to hook
> up the HSA MMU notifier.
>
> And then you'd need to add support to the
> amdgpu_amdkfd_restore_userptr_worker to validate and map userptr BOs
> managed through the GEM API.
>
> I'm not saying this is easy. I spent months trying to get this to work
> reliably for DMABuf imports a few years ago.
>
> Regards,
> Felix
>
>
>>
>> Therefore, I plan to continue with my previous direction to find a
>> "minimal impact" technical solution within KFD.
>> Regards,
>> Honglei
>>
>> On 2026/1/10 10:28, Honglei Huang wrote:
>>>
>>> Hi Felix,
>>>
>>> You're right - I understand now that the render node transition is
>>> already
>>> Appreciate the clarification.
>>>
>>> Regards,
>>> Honglei
>>>
>>>
>>> On 2026/1/10 05:14, Kuehling, Felix wrote:
>>>> FWIW, ROCr already uses rendernode APIs for our implementation of
>>>> the CUDA VM API (DMABuf imports into rendernode contexts that share
>>>> the VA space with KFD and VA mappings with more flexibility than
>>>> what we have in the KFD API). So the transition to render node APIs
>>>> has already started, especially in the memory management area. It's
>>>> not some far- off future thing.
>>>>
>>>> Regards,
>>>> Felix
>>>>
>>>> On 2026-01-09 04:07, Christian König wrote:
>>>>> Hi Honglei,
>>>>>
>>>>> I have to agree with Felix. Adding such complexity to the KFD API
>>>>> is a clear no-go from my side.
>>>>>
>>>>> Just skimming over the patch it's obvious that this isn't correctly
>>>>> implemented. You simply can't the MMU notifier ranges likes this.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> On 1/9/26 08:55, Honglei Huang wrote:
>>>>>> Hi Felix,
>>>>>>
>>>>>> Thank you for the feedback. I understand your concern about API
>>>>>> maintenance.
>>>>>>
>>>>>> From what I can see, KFD is still the core driver for all GPU
>>>>>> compute workloads. The entire compute ecosystem is built on KFD's
>>>>>> infrastructure and continues to rely on it. While the unification
>>>>>> work is ongoing, any transition to DRM render node APIs would
>>>>>> naturally take considerable time, and KFD is expected to remain
>>>>>> the primary interface for compute for the foreseeable future. This
>>>>>> batch allocation issue is affecting performance in some specific
>>>>>> computing scenarios.
>>>>>>
>>>>>> You're absolutely right about the API proliferation concern. Based
>>>>>> on your feedback, I'd like to revise the approach for v3 to
>>>>>> minimize impact by reusing the existing ioctl instead of adding a
>>>>>> new API:
>>>>>>
>>>>>> - Reuse existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl
>>>>>> - Add one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>> - No new ioctl command, no new structure
>>>>>>
>>>>>> This changes the API surface from adding a new ioctl to adding
>>>>>> just one flag.
>>>>>>
>>>>>> Actually the implementation modifies DRM's GPU memory management
>>>>>> infrastructure in amdgpu_amdkfd_gpuvm.c. If DRM render node needs
>>>>>> similar functionality later, these functions could be directly
>>>>>> reused.
>>>>>>
>>>>>> Would you be willing to review v3 with this approach?
>>>>>>
>>>>>> Regards,
>>>>>> Honglei Huang
>>>>>>
>>>>>> On 2026/1/9 03:46, Felix Kuehling wrote:
>>>>>>> I don't have time to review this in detail right now. I am
>>>>>>> concerned about adding new KFD API, when the trend is moving
>>>>>>> towards DRM render node APIs. This creates additional burden for
>>>>>>> ongoing support of these APIs in addition to the inevitable DRM
>>>>>>> render node duplicates we'll have in the future. Would it be
>>>>>>> possible to implement this batch userptr allocation in a render
>>>>>>> node API from the start?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Felix
>>>>>>>
>>>>>>>
>>>>>>> On 2026-01-04 02:21, Honglei Huang wrote:
>>>>>>>> From: Honglei Huang <honghuan@....com>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> This is v2 of the patch series to support allocating multiple
>>>>>>>> non- contiguous
>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU
>>>>>>>> virtual address.
>>>>>>>>
>>>>>>>> **Key improvements over v1:**
>>>>>>>> - NO memory pinning: uses HMM for page tracking, pages can be
>>>>>>>> swapped/ migrated
>>>>>>>> - NO impact on SVM subsystem: avoids complexity during KFD/KGD
>>>>>>>> unification
>>>>>>>> - Better approach: userptr's VA remapping design is ideal for
>>>>>>>> scattered VA registration
>>>>>>>>
>>>>>>>> Based on community feedback, v2 takes a completely different
>>>>>>>> implementation
>>>>>>>> approach by leveraging the existing userptr infrastructure
>>>>>>>> rather than
>>>>>>>> introducing new SVM-based mechanisms that required memory pinning.
>>>>>>>>
>>>>>>>> Changes from v1
>>>>>>>> ===============
>>>>>>>>
>>>>>>>> v1 attempted to solve this problem through the SVM subsystem by:
>>>>>>>> - Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range
>>>>>>>> registration
>>>>>>>> - Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special
>>>>>>>> VMA handling
>>>>>>>> - Using pin_user_pages_fast() to pin scattered memory ranges
>>>>>>>> - Registering multiple SVM ranges with pinned pages
>>>>>>>>
>>>>>>>> This approach had significant drawbacks:
>>>>>>>> 1. Memory pinning defeated the purpose of HMM-based SVM's on-
>>>>>>>> demand paging
>>>>>>>> 2. Added complexity to the SVM subsystem
>>>>>>>> 3. Prevented memory oversubscription and dynamic migration
>>>>>>>> 4. Could cause memory pressure due to locked pages
>>>>>>>> 5. Interfered with NUMA optimization and page migration
>>>>>>>>
>>>>>>>> v2 Implementation Approach
>>>>>>>> ==========================
>>>>>>>>
>>>>>>>> 1. **No memory pinning required**
>>>>>>>> - Uses HMM (Heterogeneous Memory Management) for page tracking
>>>>>>>> - Pages are NOT pinned, can be swapped/migrated when not in
>>>>>>>> use
>>>>>>>> - Supports dynamic page eviction and on-demand restore like
>>>>>>>> standard userptr
>>>>>>>>
>>>>>>>> 2. **Zero impact on KFD SVM subsystem**
>>>>>>>> - Extends ALLOC_MEMORY_OF_GPU path, not SVM
>>>>>>>> - New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH
>>>>>>>> - Zero changes to SVM code, limited scope of changes
>>>>>>>>
>>>>>>>> 3. **Perfect fit for non-contiguous VA registration**
>>>>>>>> - Userptr design naturally supports GPU VA != CPU VA mapping
>>>>>>>> - Multiple non-contiguous CPU VA ranges -> single
>>>>>>>> contiguous GPU VA
>>>>>>>> - Unlike KFD SVM which maintains VA identity, userptr
>>>>>>>> allows remapping,
>>>>>>>> This VA remapping capability makes userptr ideal for
>>>>>>>> scattered allocations
>>>>>>>>
>>>>>>>> **Implementation Details:**
>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for
>>>>>>>> invalidation
>>>>>>>> - All ranges validated together and mapped to contiguous
>>>>>>>> GPU VA
>>>>>>>> - Single kgd_mem object with array of user_range_info
>>>>>>>> structures
>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>
>>>>>>>> Patch Series Overview
>>>>>>>> =====================
>>>>>>>>
>>>>>>>> Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and
>>>>>>>> data structures
>>>>>>>> - New ioctl command and kfd_ioctl_userptr_range structure
>>>>>>>> - UAPI for userspace to request batch userptr allocation
>>>>>>>>
>>>>>>>> Patch 2/4: Extend kgd_mem for batch userptr support
>>>>>>>> - Add user_range_info and associated fields to kgd_mem
>>>>>>>> - Data structures for tracking multiple ranges per allocation
>>>>>>>>
>>>>>>>> Patch 3/4: Implement batch userptr allocation and management
>>>>>>>> - Core functions: init_user_pages_batch(),
>>>>>>>> get_user_pages_batch()
>>>>>>>> - Per-range eviction/restore handlers with unified management
>>>>>>>> - Integration with existing userptr eviction/validation flows
>>>>>>>>
>>>>>>>> Patch 4/4: Wire up batch userptr ioctl handler
>>>>>>>> - Ioctl handler with input validation
>>>>>>>> - SVM conflict checking for GPU VA and CPU VA ranges
>>>>>>>> - Integration with kfd_process and process_device
>>>>>>>> infrastructure
>>>>>>>>
>>>>>>>> Performance Comparison
>>>>>>>> ======================
>>>>>>>>
>>>>>>>> Before implementing this patch, we attempted a userspace
>>>>>>>> solution that makes
>>>>>>>> multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>> ioctl to
>>>>>>>> register non-contiguous VA ranges individually. This approach
>>>>>>>> resulted in
>>>>>>>> severe performance degradation:
>>>>>>>>
>>>>>>>> **Userspace Multiple ioctl Approach:**
>>>>>>>> - Benchmark score: ~80,000 (down from 200,000 on bare metal)
>>>>>>>> - Performance loss: 60% degradation
>>>>>>>>
>>>>>>>> **This Kernel Batch ioctl Approach:**
>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>> - Achieves near-native performance in virtualized environments
>>>>>>>>
>>>>>>>> The batch registration in kernel avoids the repeated syscall
>>>>>>>> overhead and
>>>>>>>> enables efficient unified management of scattered VA ranges,
>>>>>>>> recovering most
>>>>>>>> of the performance lost to virtualization.
>>>>>>>>
>>>>>>>> Testing Results
>>>>>>>> ===============
>>>>>>>>
>>>>>>>> The series has been tested with:
>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>> - GPU compute workloads using the batch-allocated ranges
>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>> - OpenCL CTS in KVM guest environment
>>>>>>>> - HIP catch tests in KVM guest environment
>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized
>>>>>>>> environments
>>>>>>>> - Small LLM inference (3B-7B models) using HuggingFace transformers
>>>>>>>>
>>>>>>>> Corresponding userspace patche
>>>>>>>> ================================
>>>>>>>> Userspace ROCm changes for new ioctl:
>>>>>>>> - libhsakmt: https://github.com/ROCm/rocm-systems/commit/
>>>>>>>> ac21716e5d6f68ec524e50eeef10d1d6ad7eae86
>>>>>>>>
>>>>>>>> Thank you for your review and waiting for the feedback.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Honglei Huang
>>>>>>>>
>>>>>>>> Honglei Huang (4):
>>>>>>>> drm/amdkfd: Add batch userptr allocation UAPI
>>>>>>>> drm/amdkfd: Extend kgd_mem for batch userptr support
>>>>>>>> drm/amdkfd: Implement batch userptr allocation and management
>>>>>>>> drm/amdkfd: Wire up batch userptr ioctl handler
>>>>>>>>
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 21 +
>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 543 +++++++++
>>>>>>>> ++ + +++++-
>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 159 +++++
>>>>>>>> include/uapi/linux/kfd_ioctl.h | 37 +-
>>>>>>>> 4 files changed, 740 insertions(+), 20 deletions(-)
>>>>>>>>
>>>
>>
Powered by blists - more mailing lists