linux-kernel - Re: [RFC PATCH 3/3] drm/virtio: implement blob userptr resource object

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b0bf2d5-700a-4cc7-b410-a9b2e2083b5d@amd.com>
Date: Thu, 6 Feb 2025 18:53:55 +0800
From: "Huang, Honglei1" <Honglei1.Huang@....com>
To: Demi Marie Obenour <demi@...isiblethingslab.com>
Cc: Demi Marie Obenour <demiobenour@...il.com>, Huang Rui
 <ray.huang@....com>, Stefano Stabellini <stefano.stabellini@....com>,
 virtualization@...ts.linux-foundation.org, linux-kernel@...r.kernel.org,
 David Airlie <airlied@...hat.com>, dri-devel@...ts.freedesktop.org,
 Dmitry Osipenko <dmitry.osipenko@...labora.com>,
 Gerd Hoffmann <kraxel@...hat.com>,
 Gurchetan Singh <gurchetansingh@...omium.org>, Chia-I Wu
 <olvaffe@...il.com>, Akihiko Odaki <akihiko.odaki@...nix.com>,
 Lingshan Zhu <Lingshan.Zhu@....com>,
 Xen developer discussion <xen-devel@...ts.xenproject.org>,
 Kernel KVM virtualization development <kvm@...r.kernel.org>,
 Xenia Ragiadakou <burzalodowa@...il.com>,
 Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
Subject: Re: [RFC PATCH 3/3] drm/virtio: implement blob userptr resource
 object

On 2025/1/31 8:33, Demi Marie Obenour wrote:
> On Wed, Jan 29, 2025 at 03:54:59PM -0500, Demi Marie Obenour wrote:
>> On 1/8/25 12:05 PM, Simona Vetter wrote:
>>> On Fri, Dec 27, 2024 at 10:24:29AM +0800, Huang, Honglei1 wrote:
>>>>
>>>> On 2024/12/22 9:59, Demi Marie Obenour wrote:
>>>>> On 12/20/24 10:35 AM, Simona Vetter wrote:
>>>>>> On Fri, Dec 20, 2024 at 06:04:09PM +0800, Honglei Huang wrote:
>>>>>>> From: Honglei Huang <Honglei1.Huang@....com>
>>>>>>>
>>>>>>> A virtio-gpu userptr is based on HMM notifier.
>>>>>>> Used for let host access guest userspace memory and
>>>>>>> notice the change of userspace memory.
>>>>>>> This series patches are in very beginning state,
>>>>>>> User space are pinned currently to ensure the host
>>>>>>> device memory operations are correct.
>>>>>>> The free and unmap operations for userspace can be
>>>>>>> handled by MMU notifier this is a simple and basice
>>>>>>> SVM feature for this series patches.
>>>>>>> The physical PFNS update operations is splited into
>>>>>>> two OPs in here. The evicted memories won't be used
>>>>>>> anymore but remap into host again to achieve same
>>>>>>> effect with hmm_rang_fault.
>>>>>>
>>>>>> So in my opinion there are two ways to implement userptr that make sense:
>>>>>>
>>>>>> - pinned userptr with pin_user_pages(FOLL_LONGTERM). there is not mmu
>>>>>>     notifier
>>>>>>
>>>>>> - unpinnned userptr where you entirely rely on userptr and do not hold any
>>>>>>     page references or page pins at all, for full SVM integration. This
>>>>>>     should use hmm_range_fault ideally, since that's the version that
>>>>>>     doesn't ever grab any page reference pins.
>>>>>>
>>>>>> All the in-between variants are imo really bad hacks, whether they hold a
>>>>>> page reference or a temporary page pin (which seems to be what you're
>>>>>> doing here). In much older kernels there was some justification for them,
>>>>>> because strange stuff happened over fork(), but with FOLL_LONGTERM this is
>>>>>> now all sorted out. So there's really only fully pinned, or true svm left
>>>>>> as clean design choices imo.
>>>>>>
>>>>>> With that background, why does pin_user_pages(FOLL_LONGTERM) not work for
>>>>>> you?
>>>>>
>>>>> +1 on using FOLL_LONGTERM.  Fully dynamic memory management has a huge cost
>>>>> in complexity that pinning everything avoids.  Furthermore, this avoids the
>>>>> host having to take action in response to guest memory reclaim requests.
>>>>> This avoids additional complexity (and thus attack surface) on the host side.
>>>>> Furthermore, since this is for ROCm and not for graphics, I am less concerned
>>>>> about supporting systems that require swappable GPU VRAM.
>>>>
>>>> Hi Sima and Demi,
>>>>
>>>> I totally agree the flag FOLL_LONGTERM is needed, I will add it in next
>>>> version.
>>>>
>>>> And for the first pin variants implementation, the MMU notifier is also
>>>> needed I think.Cause the userptr feature in UMD generally used like this:
>>>> the registering of userptr always is explicitly invoked by user code like
>>>> "registerMemoryToGPU(userptrAddr, ...)", but for the userptr release/free,
>>>> there is no explicit API for it, at least in hsakmt/KFD stack. User just
>>>> need call system call "free(userptrAddr)", then kernel driver will release
>>>> the userptr by MMU notifier callback.Virtio-GPU has no other way to know if
>>>> user has been free the userptr except for MMU notifior.And in UMD theres is
>>>> no way to get the free() operation is invoked by user.The only way is use
>>>> MMU notifier in virtio-GPU driver and free the corresponding data in host by
>>>> some virtio CMDs as far as I can see.
>>>>
>>>> And for the second way that is use hmm_range_fault, there is a predictable
>>>> issues as far as I can see, at least in hsakmt/KFD stack. That is the memory
>>>> may migrate when GPU/device is working. In bare metal, when memory is
>>>> migrating KFD driver will pause the compute work of the device in
>>>> mmap_wirte_lock then use hmm_range_fault to remap the migrated/evicted
>>>> memories to GPU then restore the compute work of device to ensure the
>>>> correction of the data. But in virtio-GPU driver the migration happen in
>>>> guest kernel, the evict mmu notifier callback happens in guest, a virtio CMD
>>>> can be used for notify host but as lack of mmap_write_lock protection in
>>>> host kernel, host will hold invalid data for a short period of time, this
>>>> may lead to some issues. And it is hard to fix as far as I can see.
>>>>
>>>> I will extract some APIs into helper according to your request, and I will
>>>> refactor the whole userptr implementation, use some callbacks in page
>>>> getting path, let the pin method and hmm_range_fault can be choiced
>>>> in this series patches.
>>>
>>> Ok, so if this is for svm, then you need full blast hmm, or the semantics
>>> are buggy. You cannot fake svm with pin(FOLL_LONGTERM) userptr, this does
>>> not work.
>>>
>>> The other option is that hsakmt/kfd api is completely busted, and that's
>>> kinda not a kernel problem.
>>> -Sima
>>
>> On further thought, I believe the driver needs to migrate the pages to
>> device memory (really a virtio-GPU blob object) *and* take a FOLL_LONGTERM
>> pin on them.  The reason is that it isn’t possible to migrate these pages
>> back to "host" memory without unmapping them from the GPU.  For the reasons
>> I mention in [1], I believe that temporarily revoking access to virtio-GPU
>> blob objects is not feasible.  Instead, the pages must be treated as if
>> they are permanently in device memory until guest userspace unmaps them
>> from the GPU, after which they must be migrated back to host memory.
> 
> Discussion on IRC indicates that migration isn't reliable.  This is
> because Linux core memory management is largely lock-free for
> performance reasons, so there is no way to prevent temporary elevation
> of a page's reference count.  A page with an elevated reference count
> cannot be migrated.
> 
> The only alternative I can think of is for the hypervisor to perform the
> migration.  The hypervisor can revoke the guest's access to the page
> without the guest's consent or involvement.  The host can then replace
> the page with one of its own pages, which might be on the CPU or GPU.
> Further migration between the CPU and GPU is controlled by the host
> kernel-mode driver (KMD) and host kernel memory management.  The guest
> kernel driver must take a FOLL_LONGTERM pin before telling the host to
> use the pages, but that is all.
> 
> On KVM, this should be essentially automatic, as guest memory really is
> just host userspace memory.  On Xen, this requires that the backend
> domain can revoke fronted access to _any_ frontend page, or at least
> frontend pages that have been granted to the backend.  The backend will
> then need to be able to handle page faults for the frontend pages, and
> replace the frontend pages with its own pages at will.  Furthermore,
> revoking pages that the backend has installed into the frontend must
> never fail, because the backend will panic if it does fail.
> 
> Sima, is putting guest pages under host kernel control the only option?
> I thought that this could be avoided by leaving the pages on the CPU if
> migration fails, but that won't work because there will be no way to
> migrate them to the GPU later, causing performance problems that would
> be impossible to debug.  Is waiting (possibly forever) on migration to
> finish an option?  Otherwise, this might mean extra complexity in the
> Xen hypervisor, as I do not believe the primitives needed are currently
> available.  Specifically, in addition to the primitives discussed at Xen
> Project Summit 2024, the backend also needs to intercept access to, and
> replace the contents of, arbitrary frontend-controlled pages.

Hi Demi,

I agree that to achieve the complete SVM feature in virtio-GPU, it is 
necessary to have the hypervisor deeply involved and add new features.
It needs solid design, I saw the detailed reply in a another thread, it
is very helpful,looking forward to the response from the Xen/hypervisor 
experts.

So for the current virito-GPU userptr implementation, It can not support 
the full SVM feature, it just can only let GPU access the user space 
memory, maybe can be called by userptr feature. I think I will finish 
this small part firstly and then to try to complete the whole SVM feature.

Regards,
Honglei