lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dfb0452b-1ad2-1ced-dcf8-a213b15a8e00@linux.intel.com>
Date:   Wed, 20 Sep 2023 10:29:48 +0200
From:   Thomas Hellström 
        <thomas.hellstrom@...ux.intel.com>
To:     Christian König <christian.koenig@....com>,
        Danilo Krummrich <dakr@...hat.com>
Cc:     airlied@...il.com, daniel@...ll.ch, matthew.brost@...el.com,
        sarah.walker@...tec.com, donald.robson@...tec.com,
        boris.brezillon@...labora.com, faith.ekstrand@...labora.com,
        dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH drm-misc-next v3 6/7] drm/gpuvm: generalize
 dma_resv/extobj handling and GEM validation


On 9/20/23 09:44, Thomas Hellström wrote:
> Hi,
>
> On 9/20/23 07:37, Christian König wrote:
>> Am 19.09.23 um 17:23 schrieb Thomas Hellström:
>>>
>>> On 9/19/23 17:16, Danilo Krummrich wrote:
>>>> On 9/19/23 14:21, Thomas Hellström wrote:
>>>>> Hi Christian
>>>>>
>>>>> On 9/19/23 14:07, Christian König wrote:
>>>>>> Am 13.09.23 um 17:46 schrieb Danilo Krummrich:
>>>>>>> On 9/13/23 17:33, Christian König wrote:
>>>>>>>> Am 13.09.23 um 17:15 schrieb Danilo Krummrich:
>>>>>>>>> On 9/13/23 16:26, Christian König wrote:
>>>>>>>>>> Am 13.09.23 um 14:16 schrieb Danilo Krummrich:
>>>>>>>>>>> As mentioned in a different mail thread, the reply is based 
>>>>>>>>>>> on the assumption
>>>>>>>>>>> that we don't support anything else than GPUVM updates from 
>>>>>>>>>>> the IOCTL.
>>>>>>>>>>
>>>>>>>>>> I think that this assumption is incorrect.
>>>>>>>>>
>>>>>>>>> Well, more precisely I should have said "don't support GPUVM 
>>>>>>>>> updated from within
>>>>>>>>> fence signaling critical sections". And looking at the code, 
>>>>>>>>> that doesn't seem what
>>>>>>>>> you're doing there.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Vulkan is just once specific use case, but this here should 
>>>>>>>>>> probably be able to handle other use cases as well.
>>>>>>>>>>
>>>>>>>>>> Especially with HMM you get the requirement that you need to 
>>>>>>>>>> be able to invalidate GPUVM mappings without grabbing a 
>>>>>>>>>> reservation lock.
>>>>>>>>>
>>>>>>>>> What do you mean with "invalidate GPUVM mappings" in this 
>>>>>>>>> context? drm_gpuvm_bo_evict()
>>>>>>>>> should only be called from a ttm_device_funcs::move. callback, 
>>>>>>>>> we should hold the dma-resv
>>>>>>>>> lock there.
>>>>>>>>
>>>>>>>> Well the question is which dma-resv lock do we hold?
>>>>>>>>
>>>>>>>> In the move callback we only hold the dma-resv lock of the BO 
>>>>>>>> which is moved, but when that is a shared BO then that's not 
>>>>>>>> the same as the one for the VM.
>>>>>>>
>>>>>>> Correct, Thomas' idea was to use the GEM's dma_resv lock to 
>>>>>>> protect drm_gpuvm_bo::evicted
>>>>>>> and then actually move the drm_gpuvm_bo to the VM's evicted list 
>>>>>>> once we grabbed all
>>>>>>> dma-resv locks when locking the VM's BOs using drm_exec. We can 
>>>>>>> remove them from the evicted
>>>>>>> list on validate(). This way we never touch the evicted list 
>>>>>>> without holding at least the VM's
>>>>>>> dma-resv lock.
>>>>>>>
>>>>>>> Do you have any concerns about that?
>>>>>>
>>>>>> Scratching my head a bit how that is supposed to work.
>>>>>>
>>>>>> This implies that you go over all the evicted BOs during 
>>>>>> validation and not just the one mentioned in the CS.
>>>>>>
>>>>>> That might work for Vulkan, but is pretty much a no-go for OpenGL.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> See what the eviction lock in amdgpu is doing for example.
>>>>>>>>>
>>>>>>>>> The eviction_lock seems to protect a VM state "evicting" of 
>>>>>>>>> whether any BO that
>>>>>>>>> is associated with the VM is currently evicting. At the same 
>>>>>>>>> time amdgpu protects
>>>>>>>>> the eviceted list of the VM with a different lock. So this 
>>>>>>>>> seems to be entirely
>>>>>>>>> unrelated. Tracking a "currently evicting" state is not part 
>>>>>>>>> of the GPUVM
>>>>>>>>> implementation currently and hence nothing would change for 
>>>>>>>>> amdgpu there.
>>>>>>>>
>>>>>>>> Sorry for the confusion we use different terminology in amdgpu.
>>>>>>>>
>>>>>>>> The eviction lock and evicted state is for the VM page tables, 
>>>>>>>> e.g. if the whole VM is currently not used and swapped out or 
>>>>>>>> even de-allocated.
>>>>>>>>
>>>>>>>> This is necessary because we have cases where we need to access 
>>>>>>>> the VM data without holding the dma-resv lock of this VM. 
>>>>>>>> Especially figuring out which parts of an address space contain 
>>>>>>>> mappings and which doesn't.
>>>>>>>
>>>>>>> I think this is fine, this has nothing to do with lists of 
>>>>>>> evicted GEM objects or external GEM
>>>>>>> objects, right? Marking mappings (drm_gpuva) as invalidated 
>>>>>>> (DRM_GPUVA_INVALIDATED) or accessing
>>>>>>> the VA space does not require any dma-resv locks.
>>>>>>
>>>>>> I hope so, but I'm not 100% sure.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This is a requirement which comes with HMM handling, you won't 
>>>>>>>> see this with Vulkan (or OpenGL, VAAPI etc..).
>>>>>>>>
>>>>>>>>
>>>>>>>> The invalidation lock on the other hand is what in this 
>>>>>>>> discussion is called eviction lock. This one is needed because 
>>>>>>>> what I wrote above, during the move callback only the dma-resv 
>>>>>>>> of the BO which is moved is locked, but not necessarily the 
>>>>>>>> dma-resv of the VM.
>>>>>>>
>>>>>>> That's yet another thing, right? This is used to track whether 
>>>>>>> *any* BO that belongs to the VM is
>>>>>>> currently being evicted, correct? As mentioned, as by now this 
>>>>>>> is not supported in GPUVM and hence
>>>>>>> would be the same driver specific code with the same driver 
>>>>>>> specifc lock.
>>>>>>
>>>>>> That is most likely a show stopper using this for OpenGL based 
>>>>>> workloads as far as I can see. For those you need to able to 
>>>>>> figure out which non-VM BOs have been evicted and which parts of 
>>>>>> the VM needs updates.
>>>>>
>>>>> We identify those with a bool in the gpuvm_bo, and that bool is 
>>>>> protected by the bo_resv. In essence, the "evicted" list must be 
>>>>> made up-to-date with all relevant locks held before traversing in 
>>>>> the next exec.
>>>>
>>>> What I still miss with this idea is how do we find all the 
>>>> drm_gpuvm_bo structures with the evicted bool set to true? When 
>>>> doing the drm_exec dance we come across all external ones and can 
>>>> add them to the list if needed, but what about the BOs having the 
>>>> VM's dma-resv?
>>>
>>> Oh, they can be added to the evict list directly (no bool needed) in 
>>> the eviction code, like in v3. Since for those we indeed hold the 
>>> VM's dma_resv since it's aliased with the object's dma-resv.
>>
>> Yeah, I wanted to note what Danilo seems to think about as well. How 
>> do we figure out the non-VM BOs evicted?
>>
>> We can't walk over the list of all non-VM BOs on every submission, 
>> that's to much overhead for cases with lots of non-VM BOs.
>>
>> And we can't rely on userspace sending all non-VM BOs as used list 
>> down to the kernel with each submission.
>>
>> Regards,
>> Christian.
>
> No, that's not needed: Mechanism below.
>
> 1) We maintain an evicted list. Typically protected by the vm resv.
> 2) Each gpuvm_bo has a bool "evicted". Protected by the bo resv.
>
> a) Evicting a vm bo: The vm resv is held by the eviction code. Just 
> put it on the evicted list.
> b) Evicting a shared/external bo: The bo resv is held by the eviction 
> code. Set the "evicted" bool
> c) Validating the evicted list on exec: Loop through all 
> *external/shared* bos. Lock them. After locking, check the "evicted" 
> bool, if it's true. put the bo on the evicted list (we hold the VM 
> resv at this point) and clear the "evicted" bool. Note that other vms 
> will have their own gpuvm_bo which is marked evicted.
>
> I have this coded up in a patch for Xe and it seems to be working 
> properly.
>
> /Thomas
>
Something along the lines of the attach patch.


View attachment "0001-drm-gpuvm-Adjustment-for-extobj-eviction.patch" of type "text/x-patch" (3065 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ