linux-kernel - Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <38264429-a256-4c2f-bcfd-8a021d9603b2@amd.com>
Date: Mon, 9 Feb 2026 20:52:36 +0800
From: Honglei Huang <honghuan@....com>
To: Christian König <christian.koenig@....com>
Cc: Felix.Kuehling@....com, Philip.Yang@....com, Ray.Huang@....com,
 alexander.deucher@....com, dmitry.osipenko@...labora.com,
 Xinhui.Pan@....com, airlied@...il.com, daniel@...ll.ch,
 amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, akpm@...ux-foundation.org
Subject: Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support


DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()

My implementation follows the same pattern. The detailed comparison
of invalidation path was provided in the second half of my previous mail.

On 2026/2/9 18:16, Christian König wrote:
> On 2/9/26 07:14, Honglei Huang wrote:
>>
>> I've reworked the implementation in v4. The fix is actually inspired
>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>
>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>> multiple user virtual address ranges under a single mmu_interval_notifier,
>> and these ranges can be non-contiguous which is essentially the same
>> problem that batch userptr needs to solve: one BO backed by multiple
>> non-contiguous CPU VA ranges sharing one notifier.
> 
> That still doesn't solve the sequencing problem.
> 
> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
> 
> So how should that work with your patch set?
> 
> Regards,
> Christian.
> 
>>
>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>    notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>    notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>> The Xe driver passes
>>    xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>> as the notifier_size, so one notifier can cover many of MB of VA space
>> containing multiple non-contiguous ranges.
>>
>> And DRM GPU SVM solves the per-range validity problem with flag-based
>> validation instead of seq-based validation in:
>>    - drm_gpusvm_pages_valid() checks
>>        flags.has_dma_mapping
>>      not notifier_seq. The comment explicitly states:
>>        "This is akin to a notifier seqno check in the HMM documentation
>>         but due to wider notifiers (i.e., notifiers which span multiple
>>         ranges) this function is required for finer grained checking"
>>    - __drm_gpusvm_unmap_pages() clears
>>        flags.has_dma_mapping = false  under notifier_lock
>>    - drm_gpusvm_get_pages() sets
>>        flags.has_dma_mapping = true  under notifier_lock
>> I adopted the same approach.
>>
>> DRM GPU SVM:
>>    drm_gpusvm_notifier_invalidate()
>>      down_write(&gpusvm->notifier_lock);
>>      mmu_interval_set_seq(mni, cur_seq);
>>      gpusvm->ops->invalidate()
>>        -> xe_svm_invalidate()
>>           drm_gpusvm_for_each_range()
>>             -> __drm_gpusvm_unmap_pages()
>>                WRITE_ONCE(flags.has_dma_mapping = false);  // clear flag
>>      up_write(&gpusvm->notifier_lock);
>>
>> KFD batch userptr:
>>    amdgpu_amdkfd_evict_userptr_batch()
>>      mutex_lock(&process_info->notifier_lock);
>>      mmu_interval_set_seq(mni, cur_seq);
>>      discard_invalid_ranges()
>>        interval_tree_iter_first/next()
>>          range_info->valid = false;          // clear flag
>>      mutex_unlock(&process_info->notifier_lock);
>>
>> Both implementations:
>>    - Acquire notifier_lock FIRST, before any flag changes
>>    - Call mmu_interval_set_seq() under the lock
>>    - Use interval tree to find affected ranges within the wide notifier
>>    - Mark per-range flag as invalid/valid under the lock
>>
>> The page fault path and final validation path also follow the same
>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>> flag under the lock.
>>
>> Regards,
>> Honglei
>>
>>
>> On 2026/2/6 21:56, Christian König wrote:
>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>> From: Honglei Huang <honghuan@....com>
>>>>
>>>> Hi all,
>>>>
>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>
>>>> v3:
>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>      - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>
>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>
>>>>      - When flag is set, mmap_offset field points to range array
>>>>      - Minimal API surface change
>>>
>>> Why range of VA space for each entry?
>>>
>>>> 2. Improved MMU notifier handling:
>>>>      - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>      - Interval tree for efficient lookup of affected ranges during invalidation
>>>>      - Avoids per-range notifier overhead mentioned in v2 review
>>>
>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>
>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>
>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>
>>>> v2:
>>>>      - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>      - All ranges validated together and mapped to contiguous GPU VA
>>>>      - Single kgd_mem object with array of user_range_info structures
>>>>      - Unified eviction/restore path for all ranges in a batch
>>>>
>>>> Current Implementation Approach
>>>> ===============================
>>>>
>>>> This series implements a practical solution within existing kernel constraints:
>>>>
>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>      entire range from lowest to highest address in the batch
>>>>
>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>      which specific ranges are affected during invalidation callbacks,
>>>>      avoiding unnecessary processing for unrelated address changes
>>>>
>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>      restore paths, maintaining consistency with existing userptr handling
>>>>
>>>> Patch Series Overview
>>>> =====================
>>>>
>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>       - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>       - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>
>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>       - user_range_info structure for per-range tracking
>>>>       - Fields for batch allocation in kgd_mem
>>>>
>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>       - Interval tree for efficient range lookup during invalidation
>>>>       - mark_invalid_ranges() function
>>>>
>>>> Patch 4/8: Add batch MMU notifier support
>>>>       - Single notifier for entire VA span
>>>>       - Invalidation callback using interval tree filtering
>>>>
>>>> Patch 5/8: Implement batch userptr page management
>>>>       - get_user_pages_batch() and set_user_pages_batch()
>>>>       - Per-range page array management
>>>>
>>>> Patch 6/8: Add batch allocation function and export API
>>>>       - init_user_pages_batch() main initialization
>>>>       - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>
>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>       - Shared eviction/restore handling for batch allocations
>>>>       - Integration with existing userptr validation flows
>>>>
>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>       - Input validation and range array parsing
>>>>       - Integration with existing alloc_memory_of_gpu path
>>>>
>>>> Testing
>>>> =======
>>>>
>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>> - Memory pressure scenarios and eviction/restore cycles
>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>> - Small LLM inference (3B-7B models)
>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>
>>>> Thank you for your review and feedback.
>>>>
>>>> Best regards,
>>>> Honglei Huang
>>>>
>>>> Honglei Huang (8):
>>>>     drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>     drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>     drm/amdkfd: Implement interval tree for userptr ranges
>>>>     drm/amdkfd: Add batch MMU notifier support
>>>>     drm/amdkfd: Implement batch userptr page management
>>>>     drm/amdkfd: Add batch allocation function and export API
>>>>     drm/amdkfd: Unify userptr cleanup and update paths
>>>>     drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
>>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
>>>>    include/uapi/linux/kfd_ioctl.h                |  31 +-
>>>>    4 files changed, 697 insertions(+), 24 deletions(-)
>>>>
>>>
>>
>