lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dc1f5de7-40c4-4649-8f2f-0fee4b540783@amd.com>
Date: Sat, 10 Jan 2026 10:28:49 +0800
From: Honglei Huang <honghuan@....com>
To: "Kuehling, Felix" <felix.kuehling@....com>,
 Christian König <christian.koenig@....com>
Cc: dmitry.osipenko@...labora.com, Xinhui.Pan@....com, airlied@...il.com,
 daniel@...ll.ch, amd-gfx@...ts.freedesktop.org,
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, akpm@...ux-foundation.org,
 Honglei Huang <honglei1.huang@....com>, alexander.deucher@....com,
 Ray.Huang@....com
Subject: Re: [PATCH v2 0/4] drm/amdkfd: Add batch userptr allocation support


Hi Felix,

You're right - I understand now that the render node transition is already
Appreciate the clarification.

Regards,
Honglei


On 2026/1/10 05:14, Kuehling, Felix wrote:
> FWIW, ROCr already uses rendernode APIs for our implementation of the 
> CUDA VM API (DMABuf imports into rendernode contexts that share the VA 
> space with KFD and VA mappings with more flexibility than what we have 
> in the KFD API). So the transition to render node APIs has already 
> started, especially in the memory management area. It's not some far-off 
> future thing.
> 
> Regards,
>    Felix
> 
> On 2026-01-09 04:07, Christian König wrote:
>> Hi Honglei,
>>
>> I have to agree with Felix. Adding such complexity to the KFD API is a 
>> clear no-go from my side.
>>
>> Just skimming over the patch it's obvious that this isn't correctly 
>> implemented. You simply can't the MMU notifier ranges likes this.
>>
>> Regards,
>> Christian.
>>
>> On 1/9/26 08:55, Honglei Huang wrote:
>>> Hi Felix,
>>>
>>> Thank you for the feedback. I understand your concern about API 
>>> maintenance.
>>>
>>>  From what I can see, KFD is still the core driver for all GPU 
>>> compute workloads. The entire compute ecosystem is built on KFD's 
>>> infrastructure and continues to rely on it. While the unification 
>>> work is ongoing, any transition to DRM render node APIs would 
>>> naturally take considerable time, and KFD is expected to remain the 
>>> primary interface for compute for the foreseeable future. This batch 
>>> allocation issue is affecting performance in some specific computing 
>>> scenarios.
>>>
>>> You're absolutely right about the API proliferation concern. Based on 
>>> your feedback, I'd like to revise the approach for v3 to minimize 
>>> impact by reusing the existing ioctl instead of adding a new API:
>>>
>>> - Reuse existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl
>>> - Add one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>> - When flag is set, mmap_offset field points to range array
>>> - No new ioctl command, no new structure
>>>
>>> This changes the API surface from adding a new ioctl to adding just 
>>> one flag.
>>>
>>> Actually the implementation modifies DRM's GPU memory management
>>> infrastructure in amdgpu_amdkfd_gpuvm.c. If DRM render node needs 
>>> similar functionality later, these functions could be directly reused.
>>>
>>> Would you be willing to review v3 with this approach?
>>>
>>> Regards,
>>> Honglei Huang
>>>
>>> On 2026/1/9 03:46, Felix Kuehling wrote:
>>>> I don't have time to review this in detail right now. I am concerned 
>>>> about adding new KFD API, when the trend is moving towards DRM 
>>>> render node APIs. This creates additional burden for ongoing support 
>>>> of these APIs in addition to the inevitable DRM render node 
>>>> duplicates we'll have in the future. Would it be possible to 
>>>> implement this batch userptr allocation in a render node API from 
>>>> the start?
>>>>
>>>> Regards,
>>>>     Felix
>>>>
>>>>
>>>> On 2026-01-04 02:21, Honglei Huang wrote:
>>>>> From: Honglei Huang <honghuan@....com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This is v2 of the patch series to support allocating multiple non- 
>>>>> contiguous
>>>>> CPU virtual address ranges that map to a single contiguous GPU 
>>>>> virtual address.
>>>>>
>>>>> **Key improvements over v1:**
>>>>> - NO memory pinning: uses HMM for page tracking, pages can be 
>>>>> swapped/ migrated
>>>>> - NO impact on SVM subsystem: avoids complexity during KFD/KGD 
>>>>> unification
>>>>> - Better approach: userptr's VA remapping design is ideal for 
>>>>> scattered VA registration
>>>>>
>>>>> Based on community feedback, v2 takes a completely different 
>>>>> implementation
>>>>> approach by leveraging the existing userptr infrastructure rather than
>>>>> introducing new SVM-based mechanisms that required memory pinning.
>>>>>
>>>>> Changes from v1
>>>>> ===============
>>>>>
>>>>> v1 attempted to solve this problem through the SVM subsystem by:
>>>>> - Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range 
>>>>> registration
>>>>> - Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special VMA 
>>>>> handling
>>>>> - Using pin_user_pages_fast() to pin scattered memory ranges
>>>>> - Registering multiple SVM ranges with pinned pages
>>>>>
>>>>> This approach had significant drawbacks:
>>>>> 1. Memory pinning defeated the purpose of HMM-based SVM's on-demand 
>>>>> paging
>>>>> 2. Added complexity to the SVM subsystem
>>>>> 3. Prevented memory oversubscription and dynamic migration
>>>>> 4. Could cause memory pressure due to locked pages
>>>>> 5. Interfered with NUMA optimization and page migration
>>>>>
>>>>> v2 Implementation Approach
>>>>> ==========================
>>>>>
>>>>> 1. **No memory pinning required**
>>>>>      - Uses HMM (Heterogeneous Memory Management) for page tracking
>>>>>      - Pages are NOT pinned, can be swapped/migrated when not in use
>>>>>      - Supports dynamic page eviction and on-demand restore like 
>>>>> standard userptr
>>>>>
>>>>> 2. **Zero impact on KFD SVM subsystem**
>>>>>      - Extends ALLOC_MEMORY_OF_GPU path, not SVM
>>>>>      - New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH
>>>>>      - Zero changes to SVM code, limited scope of changes
>>>>>
>>>>> 3. **Perfect fit for non-contiguous VA registration**
>>>>>      - Userptr design naturally supports GPU VA != CPU VA mapping
>>>>>      - Multiple non-contiguous CPU VA ranges -> single contiguous 
>>>>> GPU VA
>>>>>      - Unlike KFD SVM which maintains VA identity, userptr allows 
>>>>> remapping,
>>>>>        This VA remapping capability makes userptr ideal for 
>>>>> scattered allocations
>>>>>
>>>>> **Implementation Details:**
>>>>>      - Each CPU VA range gets its own mmu_interval_notifier for 
>>>>> invalidation
>>>>>      - All ranges validated together and mapped to contiguous GPU VA
>>>>>      - Single kgd_mem object with array of user_range_info structures
>>>>>      - Unified eviction/restore path for all ranges in a batch
>>>>>
>>>>> Patch Series Overview
>>>>> =====================
>>>>>
>>>>> Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and data 
>>>>> structures
>>>>>       - New ioctl command and kfd_ioctl_userptr_range structure
>>>>>       - UAPI for userspace to request batch userptr allocation
>>>>>
>>>>> Patch 2/4: Extend kgd_mem for batch userptr support
>>>>>       - Add user_range_info and associated fields to kgd_mem
>>>>>       - Data structures for tracking multiple ranges per allocation
>>>>>
>>>>> Patch 3/4: Implement batch userptr allocation and management
>>>>>       - Core functions: init_user_pages_batch(), 
>>>>> get_user_pages_batch()
>>>>>       - Per-range eviction/restore handlers with unified management
>>>>>       - Integration with existing userptr eviction/validation flows
>>>>>
>>>>> Patch 4/4: Wire up batch userptr ioctl handler
>>>>>       - Ioctl handler with input validation
>>>>>       - SVM conflict checking for GPU VA and CPU VA ranges
>>>>>       - Integration with kfd_process and process_device infrastructure
>>>>>
>>>>> Performance Comparison
>>>>> ======================
>>>>>
>>>>> Before implementing this patch, we attempted a userspace solution 
>>>>> that makes
>>>>> multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl to
>>>>> register non-contiguous VA ranges individually. This approach 
>>>>> resulted in
>>>>> severe performance degradation:
>>>>>
>>>>> **Userspace Multiple ioctl Approach:**
>>>>> - Benchmark score: ~80,000 (down from 200,000 on bare metal)
>>>>> - Performance loss: 60% degradation
>>>>>
>>>>> **This Kernel Batch ioctl Approach:**
>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>> - Achieves near-native performance in virtualized environments
>>>>>
>>>>> The batch registration in kernel avoids the repeated syscall 
>>>>> overhead and
>>>>> enables efficient unified management of scattered VA ranges, 
>>>>> recovering most
>>>>> of the performance lost to virtualization.
>>>>>
>>>>> Testing Results
>>>>> ===============
>>>>>
>>>>> The series has been tested with:
>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>> - GPU compute workloads using the batch-allocated ranges
>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>> - OpenCL CTS in KVM guest environment
>>>>> - HIP catch tests in KVM guest environment
>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>> - Small LLM inference (3B-7B models) using HuggingFace transformers
>>>>>
>>>>> Corresponding userspace patche
>>>>> ================================
>>>>> Userspace ROCm changes for new ioctl:
>>>>> - libhsakmt: https://github.com/ROCm/rocm-systems/commit/ 
>>>>> ac21716e5d6f68ec524e50eeef10d1d6ad7eae86
>>>>>
>>>>> Thank you for your review and waiting for the feedback.
>>>>>
>>>>> Best regards,
>>>>> Honglei Huang
>>>>>
>>>>> Honglei Huang (4):
>>>>>     drm/amdkfd: Add batch userptr allocation UAPI
>>>>>     drm/amdkfd: Extend kgd_mem for batch userptr support
>>>>>     drm/amdkfd: Implement batch userptr allocation and management
>>>>>     drm/amdkfd: Wire up batch userptr ioctl handler
>>>>>
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  21 +
>>>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 543 ++++++++++++ 
>>>>> +++++-
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 159 +++++
>>>>>    include/uapi/linux/kfd_ioctl.h                |  37 +-
>>>>>    4 files changed, 740 insertions(+), 20 deletions(-)
>>>>>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ