lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cffdd191-2fdd-4b5d-abf2-4cf77b96b681@amd.com>
Date: Mon, 9 Feb 2026 23:46:46 +0800
From: Honglei Huang <honghuan@....com>
To: Christian König <christian.koenig@....com>
Cc: Felix.Kuehling@....com, Philip.Yang@....com, Ray.Huang@....com,
 alexander.deucher@....com, dmitry.osipenko@...labora.com,
 Xinhui.Pan@....com, airlied@...il.com, daniel@...ll.ch,
 amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, akpm@...ux-foundation.org
Subject: Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support



Agreed with you that with many ranges, the probability of
cross-invalidation during sequential hmm_range_fault() calls
increases, and in a extreme scenario this could lead to excessive
retries. I had been focused on proving correctness and missed the
scalability.

I propose the further plan:

Will add a retry limit similar to what DRM GPU SVM does with
DRM_GPUSVM_MAX_RETRIES. This bounds the worst case.
This maybe ok to make the current batch userptr usable.

And I agree that teaching walk_page_range() to handle
non-contiguous VA sets in a single walk would be the proper
long-term solution. That work would benefit not only KFD batch
userptr. Will keep digging out the better solution.

Regards,
Honglei

On 2026/2/9 23:07, Christian König wrote:
> On 2/9/26 15:44, Honglei Huang wrote:
>> you said that DRM GPU SVM has the same pattern, but argued
>> that it is not designed for "batch userptr". However, this distinction
>> has no technical significance. The core problem is "multiple ranges
>> under one wide notifier doing per-range hmm_range_fault". Whether
>> these ranges are dynamically created by GPU page faults or
>> batch-specified via ioctl, the concurrency safety mechanism is
>> same.
>>
>> You said "each hmm_range_fault() can invalidate the other ranges
>> while faulting them in". Yes, this can happen but this is precisely
>> the scenario that mem->invalid catches:
>>
>>    1. hmm_range_fault(A) succeeds
>>    2. hmm_range_fault(B) triggers reclaim → A's pages swapped out
>>       → MMU notifier callback:
>>         mutex_lock(notifier_lock)
>>           range_A->valid = false
>>           mem->invalid++
>>         mutex_unlock(notifier_lock)
>>    3. hmm_range_fault(B) completes
>>    4. Commit phase:
>>         mutex_lock(notifier_lock)
>>           mem->invalid != saved_invalid
>>           → return -EAGAIN, retry entire batch
>>         mutex_unlock(notifier_lock)
>>
>>   invalid pages are never committed.
> 
> Once more that is not the problem. I completely agree that this is all correctly handled.
> 
> The problem is that the more hmm_ranges you get the more likely it is that getting another pfn invalidates a pfn you previously acquired.
> 
> So this can end up in an endless loop, and that's why the GPUSVM code also has a timeout on the retry.
> 
> 
> What you need to figure out is how to teach hmm_range_fault() and the underlying walk_page_range() how to skip entries which you are not interested in.
> 
> Just a trivial example, assuming you have the following VAs you want your userptr to be filled in with: 3, 1, 5, 8, 7, 2
> 
> To handle this case you need to build a data structure which tells you what is the smalest, largest and where each VA in the middle comes in. So you need something like: 1->1, 2->5, 3->0, 5->2, 7->4, 8->3
> 
> Then you would call walk_page_range(mm, 1, 8, ops, data), the pud walk decides if it needs to go into pmd or eventually fault, the pmd walk decides if ptes needs to be filled in etc...
> 
> The final pte handler then fills in the pfns linearly for the addresses you need.
> 
> And yeah I perfectly know that this is horrible complicated, but as far as I can see everything else will just not scale.
> 
> Creating hundreds of separate userptrs only scales up to a few megabyte and then falls apart.
> 
> Regards,
> Christian.
> 
>>
>> Regards,
>> Honglei
>>
>>
>> On 2026/2/9 22:25, Christian König wrote:
>>> On 2/9/26 15:16, Honglei Huang wrote:
>>>> The case you described: one hmm_range_fault() invalidating another's
>>>> seq under the same notifier, is already handled in the implementation.
>>>>
>>>>    example: suppose ranges A, B, C share one notifier:
>>>>
>>>>     1. hmm_range_fault(A) succeeds, seq_A recorded
>>>>     2. External invalidation occurs, triggers callback:
>>>>        mutex_lock(notifier_lock)
>>>>          → mmu_interval_set_seq()
>>>>          → range_A->valid = false
>>>>          → mem->invalid++
>>>>        mutex_unlock(notifier_lock)
>>>>     3. hmm_range_fault(B) succeeds
>>>>     4. Commit phase:
>>>>        mutex_lock(notifier_lock)
>>>>          → check mem->invalid != saved_invalid
>>>>          → return -EAGAIN, retry the entire batch
>>>>        mutex_unlock(notifier_lock)
>>>>
>>>> All concurrent invalidations are caught by the mem->invalid counter.
>>>> Additionally, amdgpu_ttm_tt_get_user_pages_done() in confirm_valid_user_pages_locked
>>>> performs a per-range mmu_interval_read_retry() as a final safety check.
>>>>
>>>> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
>>>> hmm_range_fault() per-range independently there is no array version
>>>> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
>>>> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
>>>> accepted upstream.
>>>>
>>>> The number of batch ranges is controllable. And even if it
>>>> scales to thousands, DRM GPU SVM faces exactly the same situation:
>>>> it does not need an array version of hmm_range_fault either, which
>>>> shows this is a correctness question, not a performance one. For
>>>> correctness, I believe DRM GPU SVM already demonstrates the approach
>>>> is ok.
>>>
>>> Well yes, GPU SVM would have exactly the same problems. But that also doesn't have a create bulk userptr interface.
>>>
>>> The implementation is simply not made for this use case, and as far as I know no current upstream implementation is.
>>>
>>>> For performance, I have tested with thousands of ranges present:
>>>> performance reaches 80%~95% of the native driver, and all OpenCL
>>>> and ROCr test suites pass with no correctness issues.
>>>
>>> Testing can only falsify a system and not verify it.
>>>
>>>> Here is how DRM GPU SVM handles correctness with multiple ranges
>>>> under one wide notifier doing per-range hmm_range_fault:
>>>>
>>>>     Invalidation: drm_gpusvm_notifier_invalidate()
>>>>       - Acquires notifier_lock
>>>>       - Calls mmu_interval_set_seq()
>>>>       - Iterates affected ranges via driver callback (xe_svm_invalidate)
>>>>       - Clears has_dma_mapping = false for each affected range (under lock)
>>>>       - Releases notifier_lock
>>>>
>>>>     Fault: drm_gpusvm_get_pages()  (called per-range independently)
>>>>       - mmu_interval_read_begin() to get seq
>>>>       - hmm_range_fault() outside lock
>>>>       - Acquires notifier_lock
>>>>       - mmu_interval_read_retry() → if stale, release lock and retry
>>>>       - DMA map pages + set has_dma_mapping = true (under lock)
>>>>       - Releases notifier_lock
>>>>
>>>>     Validation: drm_gpusvm_pages_valid()
>>>>       - Checks has_dma_mapping flag (under lock), NOT seq
>>>>
>>>> If invalidation occurs between two per-range faults, the flag is
>>>> cleared under lock, and either mmu_interval_read_retry catches it
>>>> in the current fault, or drm_gpusvm_pages_valid() catches it at
>>>> validation time. No stale pages are ever committed.
>>>>
>>>> KFD batch userptr uses the same three-step pattern:
>>>>
>>>>     Invalidation: amdgpu_amdkfd_evict_userptr_batch()
>>>>       - Acquires notifier_lock
>>>>       - Calls mmu_interval_set_seq()
>>>>       - Iterates affected ranges via interval_tree
>>>>       - Sets range->valid = false for each affected range (under lock)
>>>>       - Increments mem->invalid (under lock)
>>>>       - Releases notifier_lock
>>>>
>>>>     Fault: update_invalid_user_pages()
>>>>       - Per-range hmm_range_fault() outside lock
>>>
>>> And here the idea falls apart. Each hmm_range_fault() can invalidate the other ranges while faulting them in.
>>>
>>> That is not fundamentally solveable, but by moving the handling further into hmm_range_fault it makes it much less likely that something goes wrong.
>>>
>>> So once more as long as this still uses this hacky approach I will clearly reject this implementation.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>       - Acquires notifier_lock
>>>>       - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
>>>>       - Sets range->valid = true for faulted ranges (under lock)
>>>>       - Releases notifier_lock
>>>>
>>>>     Validation: valid_user_pages_batch()
>>>>       - Checks range->valid flag
>>>>       - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>>>>
>>>> The logic is equivalent as far as I can see.
>>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>
>>>>
>>>> On 2026/2/9 21:27, Christian König wrote:
>>>>> On 2/9/26 14:11, Honglei Huang wrote:
>>>>>>
>>>>>> So the drm svm is also a NAK?
>>>>>>
>>>>>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>>>>>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>>>>>> so far.
>>>>>
>>>>> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>>>>>
>>>>> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>>>>>
>>>>> In other words something like this here is needed:
>>>>>
>>>>> retry:
>>>>>       hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>       hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>>>>> ...
>>>>>       while (true) {
>>>>>           mmap_read_lock(mm);
>>>>>           err = hmm_range_fault(&hmm_range);
>>>>>           mmap_read_unlock(mm);
>>>>>
>>>>>           if (err == -EBUSY) {
>>>>>               if (time_after(jiffies, timeout))
>>>>>                   break;
>>>>>
>>>>>               hmm_range.notifier_seq =
>>>>>                   mmu_interval_read_begin(notifier);
>>>>>               continue;
>>>>>           }
>>>>>           break;
>>>>>       }
>>>>> ...
>>>>>       for (i = 0, j = 0; i < npages; ++j) {
>>>>> ...
>>>>>           dma_map_page(...)
>>>>> ...
>>>>>       grab_notifier_lock();
>>>>>       if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>>>>>           goto retry;
>>>>>       restart_queues();
>>>>>       drop_notifier_lock();
>>>>> ...
>>>>>
>>>>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>>>>>
>>>>> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>>>>>
>>>>> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>>>>>
>>>>> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>>
>>>>>> On 2026/2/9 20:59, Christian König wrote:
>>>>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>>>>
>>>>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>>>>>
>>>>>>> As far as I can see that doesn't help the slightest.
>>>>>>>
>>>>>>>> My implementation follows the same pattern. The detailed comparison
>>>>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>>>>
>>>>>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>>>>>
>>>>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>>>>
>>>>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>>>>
>>>>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>>>>
>>>>>>>>> That still doesn't solve the sequencing problem.
>>>>>>>>>
>>>>>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>>>>>
>>>>>>>>> So how should that work with your patch set?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>>>>>        notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>>>>>>        notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>>>>>> The Xe driver passes
>>>>>>>>>>        xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>>>>
>>>>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>>>>> validation instead of seq-based validation in:
>>>>>>>>>>        - drm_gpusvm_pages_valid() checks
>>>>>>>>>>            flags.has_dma_mapping
>>>>>>>>>>          not notifier_seq. The comment explicitly states:
>>>>>>>>>>            "This is akin to a notifier seqno check in the HMM documentation
>>>>>>>>>>             but due to wider notifiers (i.e., notifiers which span multiple
>>>>>>>>>>             ranges) this function is required for finer grained checking"
>>>>>>>>>>        - __drm_gpusvm_unmap_pages() clears
>>>>>>>>>>            flags.has_dma_mapping = false  under notifier_lock
>>>>>>>>>>        - drm_gpusvm_get_pages() sets
>>>>>>>>>>            flags.has_dma_mapping = true  under notifier_lock
>>>>>>>>>> I adopted the same approach.
>>>>>>>>>>
>>>>>>>>>> DRM GPU SVM:
>>>>>>>>>>        drm_gpusvm_notifier_invalidate()
>>>>>>>>>>          down_write(&gpusvm->notifier_lock);
>>>>>>>>>>          mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>>>          gpusvm->ops->invalidate()
>>>>>>>>>>            -> xe_svm_invalidate()
>>>>>>>>>>               drm_gpusvm_for_each_range()
>>>>>>>>>>                 -> __drm_gpusvm_unmap_pages()
>>>>>>>>>>                    WRITE_ONCE(flags.has_dma_mapping = false);  // clear flag
>>>>>>>>>>          up_write(&gpusvm->notifier_lock);
>>>>>>>>>>
>>>>>>>>>> KFD batch userptr:
>>>>>>>>>>        amdgpu_amdkfd_evict_userptr_batch()
>>>>>>>>>>          mutex_lock(&process_info->notifier_lock);
>>>>>>>>>>          mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>>>          discard_invalid_ranges()
>>>>>>>>>>            interval_tree_iter_first/next()
>>>>>>>>>>              range_info->valid = false;          // clear flag
>>>>>>>>>>          mutex_unlock(&process_info->notifier_lock);
>>>>>>>>>>
>>>>>>>>>> Both implementations:
>>>>>>>>>>        - Acquire notifier_lock FIRST, before any flag changes
>>>>>>>>>>        - Call mmu_interval_set_seq() under the lock
>>>>>>>>>>        - Use interval tree to find affected ranges within the wide notifier
>>>>>>>>>>        - Mark per-range flag as invalid/valid under the lock
>>>>>>>>>>
>>>>>>>>>> The page fault path and final validation path also follow the same
>>>>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>>>>> flag under the lock.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Honglei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>>>>> From: Honglei Huang <honghuan@....com>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>>>>>
>>>>>>>>>>>> v3:
>>>>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>>>>>          - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>>>>
>>>>>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>>>>>
>>>>>>>>>>>>          - When flag is set, mmap_offset field points to range array
>>>>>>>>>>>>          - Minimal API surface change
>>>>>>>>>>>
>>>>>>>>>>> Why range of VA space for each entry?
>>>>>>>>>>>
>>>>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>>>>>          - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>>>>>>          - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>>>>>>          - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>>>>
>>>>>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>>>>>
>>>>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>>>>>
>>>>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>>>>>
>>>>>>>>>>>> v2:
>>>>>>>>>>>>          - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>>>>>>          - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>>>>>          - Single kgd_mem object with array of user_range_info structures
>>>>>>>>>>>>          - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>>>>
>>>>>>>>>>>> Current Implementation Approach
>>>>>>>>>>>> ===============================
>>>>>>>>>>>>
>>>>>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>>>>>          entire range from lowest to highest address in the batch
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>>>>>          which specific ranges are affected during invalidation callbacks,
>>>>>>>>>>>>          avoiding unnecessary processing for unrelated address changes
>>>>>>>>>>>>
>>>>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>>>>>          restore paths, maintaining consistency with existing userptr handling
>>>>>>>>>>>>
>>>>>>>>>>>> Patch Series Overview
>>>>>>>>>>>> =====================
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>>>>>           - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>>>>>           - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>>>           - user_range_info structure for per-range tracking
>>>>>>>>>>>>           - Fields for batch allocation in kgd_mem
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>>>>>           - Interval tree for efficient range lookup during invalidation
>>>>>>>>>>>>           - mark_invalid_ranges() function
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>>>>>           - Single notifier for entire VA span
>>>>>>>>>>>>           - Invalidation callback using interval tree filtering
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>>>>>           - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>>>>>           - Per-range page array management
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>>>>>           - init_user_pages_batch() main initialization
>>>>>>>>>>>>           - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>>>>>           - Shared eviction/restore handling for batch allocations
>>>>>>>>>>>>           - Integration with existing userptr validation flows
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>>>>>           - Input validation and range array parsing
>>>>>>>>>>>>           - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>>>>
>>>>>>>>>>>> Testing
>>>>>>>>>>>> =======
>>>>>>>>>>>>
>>>>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Honglei Huang
>>>>>>>>>>>>
>>>>>>>>>>>> Honglei Huang (8):
>>>>>>>>>>>>         drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>>>>>         drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>>>         drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>>>>>         drm/amdkfd: Add batch MMU notifier support
>>>>>>>>>>>>         drm/amdkfd: Implement batch userptr page management
>>>>>>>>>>>>         drm/amdkfd: Add batch allocation function and export API
>>>>>>>>>>>>         drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>>>>>         drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>>>>
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
>>>>>>>>>>>>        .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
>>>>>>>>>>>>        include/uapi/linux/kfd_ioctl.h                |  31 +-
>>>>>>>>>>>>        4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ