[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0cbf0f3-121d-ff37-3a0f-eca798d56e47@amd.com>
Date: Thu, 21 Sep 2023 16:21:10 +0200
From: Christian König <christian.koenig@....com>
To: Danilo Krummrich <dakr@...hat.com>, airlied@...il.com,
daniel@...ll.ch, matthew.brost@...el.com,
thomas.hellstrom@...ux.intel.com, sarah.walker@...tec.com,
donald.robson@...tec.com, boris.brezillon@...labora.com,
faith.ekstrand@...labora.com
Cc: dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH drm-misc-next v4 4/8] drm/gpuvm: add common dma-resv per
struct drm_gpuvm
Am 21.09.23 um 15:34 schrieb Danilo Krummrich:
> On 9/21/23 09:39, Christian König wrote:
>> Am 20.09.23 um 16:42 schrieb Danilo Krummrich:
>>> Provide a common dma-resv for GEM objects not being used outside of
>>> this
>>> GPU-VM. This is used in a subsequent patch to generalize dma-resv,
>>> external and evicted object handling and GEM validation.
>>>
>>> Signed-off-by: Danilo Krummrich <dakr@...hat.com>
>>> ---
>>> drivers/gpu/drm/drm_gpuvm.c | 9 +++++++--
>>> drivers/gpu/drm/nouveau/nouveau_uvmm.c | 2 +-
>>> include/drm/drm_gpuvm.h | 17 ++++++++++++++++-
>>> 3 files changed, 24 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
>>> index bfea4a8a19ec..cbf4b738a16c 100644
>>> --- a/drivers/gpu/drm/drm_gpuvm.c
>>> +++ b/drivers/gpu/drm/drm_gpuvm.c
>>> @@ -655,6 +655,7 @@ drm_gpuva_range_valid(struct drm_gpuvm *gpuvm,
>>> /**
>>> * drm_gpuvm_init() - initialize a &drm_gpuvm
>>> * @gpuvm: pointer to the &drm_gpuvm to initialize
>>> + * @drm: the drivers &drm_device
>>> * @name: the name of the GPU VA space
>>> * @start_offset: the start offset of the GPU VA space
>>> * @range: the size of the GPU VA space
>>> @@ -668,7 +669,7 @@ drm_gpuva_range_valid(struct drm_gpuvm *gpuvm,
>>> * &name is expected to be managed by the surrounding driver
>>> structures.
>>> */
>>> void
>>> -drm_gpuvm_init(struct drm_gpuvm *gpuvm,
>>> +drm_gpuvm_init(struct drm_gpuvm *gpuvm, struct drm_device *drm,
>>> const char *name,
>>> u64 start_offset, u64 range,
>>> u64 reserve_offset, u64 reserve_range,
>>> @@ -694,6 +695,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm,
>>> reserve_range)))
>>> __drm_gpuva_insert(gpuvm, &gpuvm->kernel_alloc_node);
>>> }
>>> +
>>> + drm_gem_private_object_init(drm, &gpuvm->d_obj, 0);
>>> }
>>> EXPORT_SYMBOL_GPL(drm_gpuvm_init);
>>> @@ -713,7 +716,9 @@ drm_gpuvm_destroy(struct drm_gpuvm *gpuvm)
>>> __drm_gpuva_remove(&gpuvm->kernel_alloc_node);
>>> WARN(!RB_EMPTY_ROOT(&gpuvm->rb.tree.rb_root),
>>> - "GPUVA tree is not empty, potentially leaking memory.");
>>> + "GPUVA tree is not empty, potentially leaking memory.\n");
>>> +
>>> + drm_gem_private_object_fini(&gpuvm->d_obj);
>>> }
>>> EXPORT_SYMBOL_GPL(drm_gpuvm_destroy);
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>> b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>> index 6c86b64273c3..a80ac8767843 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c
>>> @@ -1836,7 +1836,7 @@ nouveau_uvmm_init(struct nouveau_uvmm *uvmm,
>>> struct nouveau_cli *cli,
>>> uvmm->kernel_managed_addr = kernel_managed_addr;
>>> uvmm->kernel_managed_size = kernel_managed_size;
>>> - drm_gpuvm_init(&uvmm->base, cli->name,
>>> + drm_gpuvm_init(&uvmm->base, cli->drm->dev, cli->name,
>>> NOUVEAU_VA_SPACE_START,
>>> NOUVEAU_VA_SPACE_END,
>>> kernel_managed_addr, kernel_managed_size,
>>> diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
>>> index 0e802676e0a9..6666c07d7c3e 100644
>>> --- a/include/drm/drm_gpuvm.h
>>> +++ b/include/drm/drm_gpuvm.h
>>> @@ -240,14 +240,29 @@ struct drm_gpuvm {
>>> * @ops: &drm_gpuvm_ops providing the split/merge steps to
>>> drivers
>>> */
>>> const struct drm_gpuvm_ops *ops;
>>> +
>>> + /**
>>> + * @d_obj: Dummy GEM object; used internally to pass the GPU VMs
>>> + * dma-resv to &drm_exec. Provides the GPUVM's &dma-resv.
>>> + */
>>> + struct drm_gem_object d_obj;
>>
>> Yeah, as pointed out in the other mail that won't work like this.
>
> Which one? Seems that I missed it.
>
>>
>> The GPUVM contains GEM objects and therefore should probably have a
>> reference to those objects.
>>
>> When those GEM objects now use the dma-resv object embedded inside
>> the GPUVM then they also need a reference to the GPUVM to make sure
>> the dma-resv object won't be freed before they are freed.
>
> My assumption here is that GEM objects being local to a certain VM
> never out-live the VM. We never share it with anyone, otherwise it
> would be external and hence wouldn't carray the VM's dma-resv. The
> only references I see are from the VM itself (which is fine) and from
> userspace. The latter isn't a problem as long as all GEM handles are
> closed before the VM is destroyed on FD close.
>
> Do I miss something? Do we have use cases where this isn't true?
There are multiple use cases where this isn't true. One example is
memory management with TTM or drm_exec. The both grab references on the
objects they lock.
Since this is eviction code it is perfectly possible that a GEM object
is locked from a different VM then the one currently in use. So a single
GEM object from a VM can live longer than the VM itself.
Another potential case is delayed delete where a GEM object might need
to stay around a bit longer because of hw restrictions. This can simply
be that we wait for shaders to end, but also hw workarounds where we
need to wait some grace time before freeing things.
>
>
>>
>> This is a circle reference dependency.
>>
>> The simplest solution I can see is to let the driver provide the GEM
>> object to use. Amdgpu uses the root page directory object for this.
>
> Sure, we can do that, if we see cases where VM local GEM objects can
> out-live the VM.
>
>>
>> Apart from that I strongly think that we shouldn't let the GPUVM code
>> create a driver GEM object. We did that in TTM for the ghost objects
>> and it turned out to be a bad idea.
>
> You mean let GPUVM create a dummy GEM based on the drm_device from the
> driver? What were the problems that were encountered?
See ttm_buffer_object_transfer() basically we created a dummy TTM BO to
hang on the old resources for pipe-lining eviction work.
While that initially was a good idea because it speed things up quite
massively it turned out to be a big maintenance burden because those
dummy BOs ended up in driver specific functions and the driver tried to
upcast them to their internal representation. That in turn of course
didn't worked and cause very subtle memory corruptions.
KASAN was a big help to narrow those down, but we initially spend month
until we figured why some random code was going south sometimes when TTM
was in use.
I really don't want to repeat that.
Regards,
Christian.
>
>
> - Danilo
>
>>
>> Regards,
>> Christian.
>>
>>> };
>>> -void drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name,
>>> +void drm_gpuvm_init(struct drm_gpuvm *gpuvm, struct drm_device *drm,
>>> + const char *name,
>>> u64 start_offset, u64 range,
>>> u64 reserve_offset, u64 reserve_range,
>>> const struct drm_gpuvm_ops *ops);
>>> void drm_gpuvm_destroy(struct drm_gpuvm *gpuvm);
>>> +/**
>>> + * drm_gpuvm_resv() - returns the &drm_gpuvm's &dma_resv
>>> + * @gpuvm__: the &drm_gpuvm
>>> + *
>>> + * Returns: a pointer to the &drm_gpuvm's &dma_resv
>>> + */
>>> +#define drm_gpuvm_resv(gpuvm__) (&(gpuvm__)->d_obj._resv)
>>> +
>>> static inline struct drm_gpuva *
>>> __drm_gpuva_next(struct drm_gpuva *va)
>>> {
>>
>
Powered by blists - more mailing lists