[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55a55ed5-0c67-a26f-df5f-18d3b2be278e@redhat.com>
Date: Thu, 21 Sep 2023 16:38:10 +0200
From: Danilo Krummrich <dakr@...hat.com>
To: Boris Brezillon <boris.brezillon@...labora.com>
Cc: Christian König <christian.koenig@....com>,
airlied@...il.com, daniel@...ll.ch, matthew.brost@...el.com,
thomas.hellstrom@...ux.intel.com, sarah.walker@...tec.com,
donald.robson@...tec.com, faith.ekstrand@...labora.com,
dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH drm-misc-next v4 4/8] drm/gpuvm: add common dma-resv per
struct drm_gpuvm
On 9/21/23 16:25, Boris Brezillon wrote:
> On Thu, 21 Sep 2023 15:34:44 +0200
> Danilo Krummrich <dakr@...hat.com> wrote:
>
>> On 9/21/23 09:39, Christian König wrote:
>>> Am 20.09.23 um 16:42 schrieb Danilo Krummrich:
>>>> Provide a common dma-resv for GEM objects not being used outside of this
>>>> GPU-VM. This is used in a subsequent patch to generalize dma-resv,
>>>> external and evicted object handling and GEM validation.
>>>>
>>>> Signed-off-by: Danilo Krummrich <dakr@...hat.com>
>>>> ---
>>>> drivers/gpu/drm/drm_gpuvm.c | 9 +++++++--
>>>> drivers/gpu/drm/nouveau/nouveau_uvmm.c | 2 +-
>>>> include/drm/drm_gpuvm.h | 17 ++++++++++++++++-
>>>> 3 files changed, 24 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
>>>> index bfea4a8a19ec..cbf4b738a16c 100644
>>>> --- a/drivers/gpu/drm/drm_gpuvm.c
>>>> +++ b/drivers/gpu/drm/drm_gpuvm.c
>>>> @@ -655,6 +655,7 @@ drm_gpuva_range_valid(struct drm_gpuvm *gpuvm,
>>>> /**
>>>> * drm_gpuvm_init() - initialize a &drm_gpuvm
>>>> * @gpuvm: pointer to the &drm_gpuvm to initialize
>>>> + * @drm: the drivers &drm_device
>>>> * @name: the name of the GPU VA space
>>>> * @start_offset: the start offset of the GPU VA space
>>>> * @range: the size of the GPU VA space
>>>> @@ -668,7 +669,7 @@ drm_gpuva_range_valid(struct drm_gpuvm *gpuvm,
>>>> * &name is expected to be managed by the surrounding driver structures.
>>>> */
>>>> void
>>>> -drm_gpuvm_init(struct drm_gpuvm *gpuvm,
>>>> +drm_gpuvm_init(struct drm_gpuvm *gpuvm, struct drm_device *drm,
>>>> const char *name,
>>>> u64 start_offset, u64 range,
>>>> u64 reserve_offset, u64 reserve_range,
>>>> @@ -694,6 +695,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm,
>>>> reserve_range)))
>>>> __drm_gpuva_insert(gpuvm, &gpuvm->kernel_alloc_node);
>>>> }
>>>> +
>>>> + drm_gem_private_object_init(drm, &gpuvm->d_obj, 0);
>>>> }
>>>> EXPORT_SYMBOL_GPL(drm_gpuvm_init);
>>>> @@ -713,7 +716,9 @@ drm_gpuvm_destroy(struct drm_gpuvm *gpuvm)
>>>> __drm_gpuva_remove(&gpuvm->kernel_alloc_node);
>>>> WARN(!RB_EMPTY_ROOT(&gpuvm->rb.tree.rb_root),
>>>> - "GPUVA tree is not empty, potentially leaking memory.");
>>>> + "GPUVA tree is not empty, potentially leaking memory.\n");
>>>> +
>>>> + drm_gem_private_object_fini(&gpuvm->d_obj);
>>>> }
>>>> EXPORT_SYMBOL_GPL(drm_gpuvm_destroy);
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>>> index 6c86b64273c3..a80ac8767843 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>>> @@ -1836,7 +1836,7 @@ nouveau_uvmm_init(struct nouveau_uvmm *uvmm, struct nouveau_cli *cli,
>>>> uvmm->kernel_managed_addr = kernel_managed_addr;
>>>> uvmm->kernel_managed_size = kernel_managed_size;
>>>> - drm_gpuvm_init(&uvmm->base, cli->name,
>>>> + drm_gpuvm_init(&uvmm->base, cli->drm->dev, cli->name,
>>>> NOUVEAU_VA_SPACE_START,
>>>> NOUVEAU_VA_SPACE_END,
>>>> kernel_managed_addr, kernel_managed_size,
>>>> diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
>>>> index 0e802676e0a9..6666c07d7c3e 100644
>>>> --- a/include/drm/drm_gpuvm.h
>>>> +++ b/include/drm/drm_gpuvm.h
>>>> @@ -240,14 +240,29 @@ struct drm_gpuvm {
>>>> * @ops: &drm_gpuvm_ops providing the split/merge steps to drivers
>>>> */
>>>> const struct drm_gpuvm_ops *ops;
>>>> +
>>>> + /**
>>>> + * @d_obj: Dummy GEM object; used internally to pass the GPU VMs
>>>> + * dma-resv to &drm_exec. Provides the GPUVM's &dma-resv.
>>>> + */
>>>> + struct drm_gem_object d_obj;
>>>
>>> Yeah, as pointed out in the other mail that won't work like this.
>>
>> Which one? Seems that I missed it.
>>
>>>
>>> The GPUVM contains GEM objects and therefore should probably have a reference to those objects.
>>>
>>> When those GEM objects now use the dma-resv object embedded inside the GPUVM then they also need a reference to the GPUVM to make sure the dma-resv object won't be freed before they are freed.
>>
>> My assumption here is that GEM objects being local to a certain VM never out-live the VM. We never share it with anyone, otherwise it would be external and hence wouldn't carray the VM's dma-resv. The only references I see are from the VM itself (which is fine) and from userspace. The latter isn't a problem as long as all GEM handles are closed before the VM is destroyed on FD close.
>
> But we don't want to rely on userspace doing the right thing (calling
> GEM_CLOSE before releasing the VM), do we?
I assume VM's are typically released on postclose() and drm_gem_release() is
called previously. But yeah, I guess there are indeed other issues.
>
> BTW, even though my private BOs have a ref to their exclusive VM, I just
> ran into a bug because drm_gem_shmem_free() acquires the resv lock
> (which is questionable, but that's not the topic :-)) and
> I was calling vm_put(bo->exclusive_vm) before drm_gem_shmem_free(),
> leading to a use-after-free when the gem->resv is acquired. This has
> nothing to do with drm_gpuvm, but it proves that this sort of bug is
> likely to happen if we don't pay attention.
>
>>
>> Do I miss something? Do we have use cases where this isn't true?
>
> The other case I can think of is GEM being v[un]map-ed (kernel
> mapping) after the VM was released.
>
>>
>>>
>>> This is a circle reference dependency.
>
> FWIW, I solved that by having a vm_destroy() function that kills all the
> mappings in a VM, which in turn releases all the refs the VM had on
> private BOs. Then, it's just a matter of waiting for all private GEMs
> to be destroyed to get the final steps of the VM destruction, which is
> really just about releasing resources (it's called panthor_vm_release()
> in my case) executed when the VM refcount drops to zero.
>
>>>
>>> The simplest solution I can see is to let the driver provide the GEM object to use. Amdgpu uses the root page directory object for this.
>>
>> Sure, we can do that, if we see cases where VM local GEM objects can out-live the VM.
>>>
>>> Apart from that I strongly think that we shouldn't let the GPUVM code create a driver GEM object. We did that in TTM for the ghost objects and it turned out to be a bad idea.
>
> Would that really solve the circular ref issue? I mean, if you're
> taking the root page dir object as your VM resv, you still have to make
> sure it outlives the private GEMs, which means, you either need
> to take a ref on the object, leading to the same circular ref mess, or
> you need to reset private GEMs resvs before destroying this root page
> dir GEM (whose lifecyle is likely the same as your VM object which
> embeds the drm_gpuvm instance).
>
> Making it driver-specific just moves the responsibility back to drivers
> (and also allows re-using an real GEM object instead of a dummy one,
> but I'm not sure we care about saving a few hundreds bytes at that
> point), which is a good way to not take the blame if the driver does
> something wrong, but also doesn't really help people do the right thing.
>
Powered by blists - more mailing lists