[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ba8e4f2-89f2-4968-a291-e36e6fc8ab9b@amd.com>
Date: Mon, 9 Feb 2026 14:14:47 +0800
From: Honglei Huang <honghuan@....com>
To: Felix.Kuehling@....com, Christian König
<christian.koenig@....com>, alexander.deucher@....com, Philip.Yang@....com,
Ray.Huang@....com
Cc: dmitry.osipenko@...labora.com, Xinhui.Pan@....com, airlied@...il.com,
daniel@...ll.ch, amd-gfx@...ts.freedesktop.org,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, akpm@...ux-foundation.org
Subject: Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
I've reworked the implementation in v4. The fix is actually inspired
by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
multiple user virtual address ranges under a single mmu_interval_notifier,
and these ranges can be non-contiguous which is essentially the same
problem that batch userptr needs to solve: one BO backed by multiple
non-contiguous CPU VA ranges sharing one notifier.
The wide notifier is created in drm_gpusvm_notifier_alloc:
notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
The Xe driver passes
xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
as the notifier_size, so one notifier can cover many of MB of VA space
containing multiple non-contiguous ranges.
And DRM GPU SVM solves the per-range validity problem with flag-based
validation instead of seq-based validation in:
- drm_gpusvm_pages_valid() checks
flags.has_dma_mapping
not notifier_seq. The comment explicitly states:
"This is akin to a notifier seqno check in the HMM documentation
but due to wider notifiers (i.e., notifiers which span multiple
ranges) this function is required for finer grained checking"
- __drm_gpusvm_unmap_pages() clears
flags.has_dma_mapping = false under notifier_lock
- drm_gpusvm_get_pages() sets
flags.has_dma_mapping = true under notifier_lock
I adopted the same approach.
DRM GPU SVM:
drm_gpusvm_notifier_invalidate()
down_write(&gpusvm->notifier_lock);
mmu_interval_set_seq(mni, cur_seq);
gpusvm->ops->invalidate()
-> xe_svm_invalidate()
drm_gpusvm_for_each_range()
-> __drm_gpusvm_unmap_pages()
WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
up_write(&gpusvm->notifier_lock);
KFD batch userptr:
amdgpu_amdkfd_evict_userptr_batch()
mutex_lock(&process_info->notifier_lock);
mmu_interval_set_seq(mni, cur_seq);
discard_invalid_ranges()
interval_tree_iter_first/next()
range_info->valid = false; // clear flag
mutex_unlock(&process_info->notifier_lock);
Both implementations:
- Acquire notifier_lock FIRST, before any flag changes
- Call mmu_interval_set_seq() under the lock
- Use interval tree to find affected ranges within the wide notifier
- Mark per-range flag as invalid/valid under the lock
The page fault path and final validation path also follow the same
pattern as DRM GPU SVM: fault outside the lock, set/check per-range
flag under the lock.
Regards,
Honglei
On 2026/2/6 21:56, Christian König wrote:
> On 2/6/26 07:25, Honglei Huang wrote:
>> From: Honglei Huang <honghuan@....com>
>>
>> Hi all,
>>
>> This is v3 of the patch series to support allocating multiple non-contiguous
>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>
>> v3:
>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>
> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>
>> - When flag is set, mmap_offset field points to range array
>> - Minimal API surface change
>
> Why range of VA space for each entry?
>
>> 2. Improved MMU notifier handling:
>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>> - Interval tree for efficient lookup of affected ranges during invalidation
>> - Avoids per-range notifier overhead mentioned in v2 review
>
> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>
> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>
> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>
> Regards,
> Christian.
>
>>
>> 3. Better code organization: Split into 8 focused patches for easier review
>>
>> v2:
>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>> - All ranges validated together and mapped to contiguous GPU VA
>> - Single kgd_mem object with array of user_range_info structures
>> - Unified eviction/restore path for all ranges in a batch
>>
>> Current Implementation Approach
>> ===============================
>>
>> This series implements a practical solution within existing kernel constraints:
>>
>> 1. Single MMU notifier for VA span: Register one notifier covering the
>> entire range from lowest to highest address in the batch
>>
>> 2. Interval tree filtering: Use interval tree to efficiently identify
>> which specific ranges are affected during invalidation callbacks,
>> avoiding unnecessary processing for unrelated address changes
>>
>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>> restore paths, maintaining consistency with existing userptr handling
>>
>> Patch Series Overview
>> =====================
>>
>> Patch 1/8: Add userptr batch allocation UAPI structures
>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>
>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>> - user_range_info structure for per-range tracking
>> - Fields for batch allocation in kgd_mem
>>
>> Patch 3/8: Implement interval tree for userptr ranges
>> - Interval tree for efficient range lookup during invalidation
>> - mark_invalid_ranges() function
>>
>> Patch 4/8: Add batch MMU notifier support
>> - Single notifier for entire VA span
>> - Invalidation callback using interval tree filtering
>>
>> Patch 5/8: Implement batch userptr page management
>> - get_user_pages_batch() and set_user_pages_batch()
>> - Per-range page array management
>>
>> Patch 6/8: Add batch allocation function and export API
>> - init_user_pages_batch() main initialization
>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>
>> Patch 7/8: Unify userptr cleanup and update paths
>> - Shared eviction/restore handling for batch allocations
>> - Integration with existing userptr validation flows
>>
>> Patch 8/8: Wire up batch allocation in ioctl handler
>> - Input validation and range array parsing
>> - Integration with existing alloc_memory_of_gpu path
>>
>> Testing
>> =======
>>
>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>> - Various allocation sizes (4KB to 1G+ per range)
>> - Memory pressure scenarios and eviction/restore cycles
>> - OpenCL CTS and HIP catch tests in KVM guest environment
>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>> - Small LLM inference (3B-7B models)
>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>> - Performance improvement: 2x-2.4x faster than userspace approach
>>
>> Thank you for your review and feedback.
>>
>> Best regards,
>> Honglei Huang
>>
>> Honglei Huang (8):
>> drm/amdkfd: Add userptr batch allocation UAPI structures
>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>> drm/amdkfd: Implement interval tree for userptr ranges
>> drm/amdkfd: Add batch MMU notifier support
>> drm/amdkfd: Implement batch userptr page management
>> drm/amdkfd: Add batch allocation function and export API
>> drm/amdkfd: Unify userptr cleanup and update paths
>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>> include/uapi/linux/kfd_ioctl.h | 31 +-
>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>
>
Powered by blists - more mailing lists