linux-kernel - Re: [PATCH v2 0/4] drm/amdkfd: Add batch userptr allocation support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <22fe9000-117a-4b14-a51b-1349d01772f0@amd.com>
Date: Tue, 13 Jan 2026 21:40:10 +0800
From: Honglei Huang <honghuan@....com>
To: Felix Kuehling <felix.kuehling@....com>
Cc: dmitry.osipenko@...labora.com, Xinhui.Pan@....com, airlied@...il.com,
 daniel@...ll.ch, amd-gfx@...ts.freedesktop.org,
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, akpm@...ux-foundation.org,
 Honglei Huang <honglei1.huang@....com>, alexander.deucher@....com,
 Ray.Huang@....com, Christian König
 <christian.koenig@....com>
Subject: Re: [PATCH v2 0/4] drm/amdkfd: Add batch userptr allocation support

Hi Felix,

Thank you for the detailed technical guidance. You are absolutely right.

I will follow your suggestion and will work on the DRM HMM integration 
path you proposed.

That said, I believe moving towards DRM render node APIs  will take a 
very long time. DRM currently lacks SVM support. SVM is a critical 
component that affects almost every aspect of GPU computing, and porting 
it to DRM is a massive engineering effort. Realistically, KFD and DRM 
will likely need to coexist for the foreseeable future.

For this reason, I will also continue to simplify the current KFD-based 
implementation and I work on the long-term DRM solution.

Regards,
Honglei


On 2026/1/13 04:51, Felix Kuehling wrote:
> 
> On 2026-01-12 06:55, Honglei Huang wrote:
>>
>> Hi Felix,
>>
>> Thank you for the clarification about the render node transition.
>>
>> I went back and checked the relevant DRM code, and I found that it is 
>> missing some infrastructure and it seems like the SVM is not supported 
>> in drm.
>>
>> And most current hardware platforms utilize the KFD driver, we must
>> rely on the KFD infrastructure to enable this functionality. The DRM 
>> stack currently lacks the SVM infrastructure, and building it from 
>> scratch is not feasible for immediate deployment needs.
> 
> As far as I can tell, you're not using any SVM infrastructure. In fact 
> you specifically made the point that SVM wasn't suitable for your 
> application because you wanted to map non-contiguous CPU address ranges 
> into a contiguous GPU address range. So I don't understand what your 
> dependency on SVM infrastructure is here.
> 
> The DRM stack uses HMM under the hood for its userptr implementation, 
> which should be quite similar to what KFD does. The difference is in the 
> MMU notifier handling. I guess that's where some work would be needed so 
> that amdgpu_mn_invalidate_range_start_gfx can invoke 
> amdgpu_amdkfd_evict_userptr to stop usermode queues. Or maybe some 
> allocation flag in the userptr BO that tells amdgpu_hmm_register to hook 
> up the HSA MMU notifier.
> 
> And then you'd need to add support to the 
> amdgpu_amdkfd_restore_userptr_worker to validate and map userptr BOs 
> managed through the GEM API.
> 
> I'm not saying this is easy. I spent months trying to get this to work 
> reliably for DMABuf imports a few years ago.
> 
> Regards,
>    Felix
> 
> 
>>
>> Therefore, I plan to continue with my previous direction to find a 
>> "minimal impact" technical solution within KFD.
>> Regards,
>> Honglei
>>
>> On 2026/1/10 10:28, Honglei Huang wrote:
>>>
>>> Hi Felix,
>>>
>>> You're right - I understand now that the render node transition is 
>>> already
>>> Appreciate the clarification.
>>>
>>> Regards,
>>> Honglei
>>>
>>>
>>> On 2026/1/10 05:14, Kuehling, Felix wrote:
>>>> FWIW, ROCr already uses rendernode APIs for our implementation of 
>>>> the CUDA VM API (DMABuf imports into rendernode contexts that share 
>>>> the VA space with KFD and VA mappings with more flexibility than 
>>>> what we have in the KFD API). So the transition to render node APIs 
>>>> has already started, especially in the memory management area. It's 
>>>> not some far- off future thing.
>>>>
>>>> Regards,
>>>>    Felix
>>>>
>>>> On 2026-01-09 04:07, Christian König wrote:
>>>>> Hi Honglei,
>>>>>
>>>>> I have to agree with Felix. Adding such complexity to the KFD API 
>>>>> is a clear no-go from my side.
>>>>>
>>>>> Just skimming over the patch it's obvious that this isn't correctly 
>>>>> implemented. You simply can't the MMU notifier ranges likes this.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> On 1/9/26 08:55, Honglei Huang wrote:
>>>>>> Hi Felix,
>>>>>>
>>>>>> Thank you for the feedback. I understand your concern about API 
>>>>>> maintenance.
>>>>>>
>>>>>>  From what I can see, KFD is still the core driver for all GPU 
>>>>>> compute workloads. The entire compute ecosystem is built on KFD's 
>>>>>> infrastructure and continues to rely on it. While the unification 
>>>>>> work is ongoing, any transition to DRM render node APIs would 
>>>>>> naturally take considerable time, and KFD is expected to remain 
>>>>>> the primary interface for compute for the foreseeable future. This 
>>>>>> batch allocation issue is affecting performance in some specific 
>>>>>> computing scenarios.
>>>>>>
>>>>>> You're absolutely right about the API proliferation concern. Based 
>>>>>> on your feedback, I'd like to revise the approach for v3 to 
>>>>>> minimize impact by reusing the existing ioctl instead of adding a 
>>>>>> new API:
>>>>>>
>>>>>> - Reuse existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl
>>>>>> - Add one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>> - No new ioctl command, no new structure
>>>>>>
>>>>>> This changes the API surface from adding a new ioctl to adding 
>>>>>> just one flag.
>>>>>>
>>>>>> Actually the implementation modifies DRM's GPU memory management
>>>>>> infrastructure in amdgpu_amdkfd_gpuvm.c. If DRM render node needs 
>>>>>> similar functionality later, these functions could be directly 
>>>>>> reused.
>>>>>>
>>>>>> Would you be willing to review v3 with this approach?
>>>>>>
>>>>>> Regards,
>>>>>> Honglei Huang
>>>>>>
>>>>>> On 2026/1/9 03:46, Felix Kuehling wrote:
>>>>>>> I don't have time to review this in detail right now. I am 
>>>>>>> concerned about adding new KFD API, when the trend is moving 
>>>>>>> towards DRM render node APIs. This creates additional burden for 
>>>>>>> ongoing support of these APIs in addition to the inevitable DRM 
>>>>>>> render node duplicates we'll have in the future. Would it be 
>>>>>>> possible to implement this batch userptr allocation in a render 
>>>>>>> node API from the start?
>>>>>>>
>>>>>>> Regards,
>>>>>>>     Felix
>>>>>>>
>>>>>>>
>>>>>>> On 2026-01-04 02:21, Honglei Huang wrote:
>>>>>>>> From: Honglei Huang <honghuan@....com>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> This is v2 of the patch series to support allocating multiple 
>>>>>>>> non- contiguous
>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU 
>>>>>>>> virtual address.
>>>>>>>>
>>>>>>>> **Key improvements over v1:**
>>>>>>>> - NO memory pinning: uses HMM for page tracking, pages can be 
>>>>>>>> swapped/ migrated
>>>>>>>> - NO impact on SVM subsystem: avoids complexity during KFD/KGD 
>>>>>>>> unification
>>>>>>>> - Better approach: userptr's VA remapping design is ideal for 
>>>>>>>> scattered VA registration
>>>>>>>>
>>>>>>>> Based on community feedback, v2 takes a completely different 
>>>>>>>> implementation
>>>>>>>> approach by leveraging the existing userptr infrastructure 
>>>>>>>> rather than
>>>>>>>> introducing new SVM-based mechanisms that required memory pinning.
>>>>>>>>
>>>>>>>> Changes from v1
>>>>>>>> ===============
>>>>>>>>
>>>>>>>> v1 attempted to solve this problem through the SVM subsystem by:
>>>>>>>> - Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range 
>>>>>>>> registration
>>>>>>>> - Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special 
>>>>>>>> VMA handling
>>>>>>>> - Using pin_user_pages_fast() to pin scattered memory ranges
>>>>>>>> - Registering multiple SVM ranges with pinned pages
>>>>>>>>
>>>>>>>> This approach had significant drawbacks:
>>>>>>>> 1. Memory pinning defeated the purpose of HMM-based SVM's on- 
>>>>>>>> demand paging
>>>>>>>> 2. Added complexity to the SVM subsystem
>>>>>>>> 3. Prevented memory oversubscription and dynamic migration
>>>>>>>> 4. Could cause memory pressure due to locked pages
>>>>>>>> 5. Interfered with NUMA optimization and page migration
>>>>>>>>
>>>>>>>> v2 Implementation Approach
>>>>>>>> ==========================
>>>>>>>>
>>>>>>>> 1. **No memory pinning required**
>>>>>>>>      - Uses HMM (Heterogeneous Memory Management) for page tracking
>>>>>>>>      - Pages are NOT pinned, can be swapped/migrated when not in 
>>>>>>>> use
>>>>>>>>      - Supports dynamic page eviction and on-demand restore like 
>>>>>>>> standard userptr
>>>>>>>>
>>>>>>>> 2. **Zero impact on KFD SVM subsystem**
>>>>>>>>      - Extends ALLOC_MEMORY_OF_GPU path, not SVM
>>>>>>>>      - New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH
>>>>>>>>      - Zero changes to SVM code, limited scope of changes
>>>>>>>>
>>>>>>>> 3. **Perfect fit for non-contiguous VA registration**
>>>>>>>>      - Userptr design naturally supports GPU VA != CPU VA mapping
>>>>>>>>      - Multiple non-contiguous CPU VA ranges -> single 
>>>>>>>> contiguous GPU VA
>>>>>>>>      - Unlike KFD SVM which maintains VA identity, userptr 
>>>>>>>> allows remapping,
>>>>>>>>        This VA remapping capability makes userptr ideal for 
>>>>>>>> scattered allocations
>>>>>>>>
>>>>>>>> **Implementation Details:**
>>>>>>>>      - Each CPU VA range gets its own mmu_interval_notifier for 
>>>>>>>> invalidation
>>>>>>>>      - All ranges validated together and mapped to contiguous 
>>>>>>>> GPU VA
>>>>>>>>      - Single kgd_mem object with array of user_range_info 
>>>>>>>> structures
>>>>>>>>      - Unified eviction/restore path for all ranges in a batch
>>>>>>>>
>>>>>>>> Patch Series Overview
>>>>>>>> =====================
>>>>>>>>
>>>>>>>> Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and 
>>>>>>>> data structures
>>>>>>>>       - New ioctl command and kfd_ioctl_userptr_range structure
>>>>>>>>       - UAPI for userspace to request batch userptr allocation
>>>>>>>>
>>>>>>>> Patch 2/4: Extend kgd_mem for batch userptr support
>>>>>>>>       - Add user_range_info and associated fields to kgd_mem
>>>>>>>>       - Data structures for tracking multiple ranges per allocation
>>>>>>>>
>>>>>>>> Patch 3/4: Implement batch userptr allocation and management
>>>>>>>>       - Core functions: init_user_pages_batch(), 
>>>>>>>> get_user_pages_batch()
>>>>>>>>       - Per-range eviction/restore handlers with unified management
>>>>>>>>       - Integration with existing userptr eviction/validation flows
>>>>>>>>
>>>>>>>> Patch 4/4: Wire up batch userptr ioctl handler
>>>>>>>>       - Ioctl handler with input validation
>>>>>>>>       - SVM conflict checking for GPU VA and CPU VA ranges
>>>>>>>>       - Integration with kfd_process and process_device 
>>>>>>>> infrastructure
>>>>>>>>
>>>>>>>> Performance Comparison
>>>>>>>> ======================
>>>>>>>>
>>>>>>>> Before implementing this patch, we attempted a userspace 
>>>>>>>> solution that makes
>>>>>>>> multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU 
>>>>>>>> ioctl to
>>>>>>>> register non-contiguous VA ranges individually. This approach 
>>>>>>>> resulted in
>>>>>>>> severe performance degradation:
>>>>>>>>
>>>>>>>> **Userspace Multiple ioctl Approach:**
>>>>>>>> - Benchmark score: ~80,000 (down from 200,000 on bare metal)
>>>>>>>> - Performance loss: 60% degradation
>>>>>>>>
>>>>>>>> **This Kernel Batch ioctl Approach:**
>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>> - Achieves near-native performance in virtualized environments
>>>>>>>>
>>>>>>>> The batch registration in kernel avoids the repeated syscall 
>>>>>>>> overhead and
>>>>>>>> enables efficient unified management of scattered VA ranges, 
>>>>>>>> recovering most
>>>>>>>> of the performance lost to virtualization.
>>>>>>>>
>>>>>>>> Testing Results
>>>>>>>> ===============
>>>>>>>>
>>>>>>>> The series has been tested with:
>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>> - GPU compute workloads using the batch-allocated ranges
>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>> - OpenCL CTS in KVM guest environment
>>>>>>>> - HIP catch tests in KVM guest environment
>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized 
>>>>>>>> environments
>>>>>>>> - Small LLM inference (3B-7B models) using HuggingFace transformers
>>>>>>>>
>>>>>>>> Corresponding userspace patche
>>>>>>>> ================================
>>>>>>>> Userspace ROCm changes for new ioctl:
>>>>>>>> - libhsakmt: https://github.com/ROCm/rocm-systems/commit/ 
>>>>>>>> ac21716e5d6f68ec524e50eeef10d1d6ad7eae86
>>>>>>>>
>>>>>>>> Thank you for your review and waiting for the feedback.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Honglei Huang
>>>>>>>>
>>>>>>>> Honglei Huang (4):
>>>>>>>>     drm/amdkfd: Add batch userptr allocation UAPI
>>>>>>>>     drm/amdkfd: Extend kgd_mem for batch userptr support
>>>>>>>>     drm/amdkfd: Implement batch userptr allocation and management
>>>>>>>>     drm/amdkfd: Wire up batch userptr ioctl handler
>>>>>>>>
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  21 +
>>>>>>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 543 +++++++++ 
>>>>>>>> ++ + +++++-
>>>>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 159 +++++
>>>>>>>>    include/uapi/linux/kfd_ioctl.h                |  37 +-
>>>>>>>>    4 files changed, 740 insertions(+), 20 deletions(-)
>>>>>>>>
>>>
>>