[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9b91e06c-50fd-b567-7d2b-a2597d01c4dc@amd.com>
Date: Thu, 28 Apr 2022 11:27:08 -0400
From: Andrey Grodzovsky <andrey.grodzovsky@....com>
To: Hangyu Hua <hbh25y@...il.com>, yuq825@...il.com, airlied@...ux.ie,
daniel@...ll.ch
Cc: dri-devel@...ts.freedesktop.org, lima@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] gpu: drm: remove redundant dma_fence_put() when
drm_sched_job_add_dependency() fails
On 2022-04-28 04:56, Hangyu Hua wrote:
> On 2022/4/27 22:43, Andrey Grodzovsky wrote:
>>
>> On 2022-04-26 22:31, Hangyu Hua wrote:
>>> On 2022/4/26 22:55, Andrey Grodzovsky wrote:
>>>>
>>>> On 2022-04-25 22:54, Hangyu Hua wrote:
>>>>> On 2022/4/25 23:42, Andrey Grodzovsky wrote:
>>>>>> On 2022-04-25 04:36, Hangyu Hua wrote:
>>>>>>
>>>>>>> When drm_sched_job_add_dependency() fails, dma_fence_put() will
>>>>>>> be called
>>>>>>> internally. Calling it again after
>>>>>>> drm_sched_job_add_dependency() finishes
>>>>>>> may result in a dangling pointer.
>>>>>>>
>>>>>>> Fix this by removing redundant dma_fence_put().
>>>>>>>
>>>>>>> Signed-off-by: Hangyu Hua <hbh25y@...il.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/lima/lima_gem.c | 1 -
>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 1 -
>>>>>>> 2 files changed, 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/lima/lima_gem.c
>>>>>>> b/drivers/gpu/drm/lima/lima_gem.c
>>>>>>> index 55bb1ec3c4f7..99c8e7f6bb1c 100644
>>>>>>> --- a/drivers/gpu/drm/lima/lima_gem.c
>>>>>>> +++ b/drivers/gpu/drm/lima/lima_gem.c
>>>>>>> @@ -291,7 +291,6 @@ static int lima_gem_add_deps(struct drm_file
>>>>>>> *file, struct lima_submit *submit)
>>>>>>> err =
>>>>>>> drm_sched_job_add_dependency(&submit->task->base, fence);
>>>>>>> if (err) {
>>>>>>> - dma_fence_put(fence);
>>>>>>> return err;
>>>>>>
>>>>>>
>>>>>> Makes sense here
>>>>>>
>>>>>>
>>>>>>> }
>>>>>>> }
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index b81fceb0b8a2..ebab9eca37a8 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -708,7 +708,6 @@ int
>>>>>>> drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
>>>>>>> dma_fence_get(fence);
>>>>>>> ret = drm_sched_job_add_dependency(job, fence);
>>>>>>> if (ret) {
>>>>>>> - dma_fence_put(fence);
>>>>>>
>>>>>>
>>>>>>
>>>>>> Not sure about this one since if you look at the relevant commits -
>>>>>> 'drm/scheduler: fix drm_sched_job_add_implicit_dependencies' and
>>>>>> 'drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder'
>>>>>> You will see that the dma_fence_put here balances the extra
>>>>>> dma_fence_get
>>>>>> above
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>
>>>>> I don't think so. I checked the call chain and found no additional
>>>>> dma_fence_get(). But dma_fence_get() needs to be called before
>>>>> drm_sched_job_add_dependency() to keep the counter balanced.
>>>>
>>>>
>>>> I don't say there is an additional get, I just say that
>>>> drm_sched_job_add_dependency doesn't grab an extra reference to the
>>>> fences it stores so this needs to be done outside and for that
>>>> drm_sched_job_add_implicit_dependencies->dma_fence_get is called
>>>> and, if this addition fails you just call dma_fence_put to keep the
>>>> counter balanced.
>>>>
>>>
>>> drm_sched_job_add_implicit_dependencies() will call
>>> drm_sched_job_add_dependency(). And drm_sched_job_add_dependency()
>>> already call dma_fence_put() when it fails. Calling dma_fence_put()
>>> twice doesn't make sense.
>>>
>>> dma_fence_get() is in [2]. But dma_fence_put() will be called in [1]
>>> and [3] when xa_alloc() fails.
>>
>>
>> The way I see it, [2] and [3] are mat matching *get* and *put*
>> respectively. [1] *put* is against the original
>> dma_fence_init->kref_init of the fence which always set the refcount
>> at 1.
>> Also in support of this see commit 'drm/scheduler: fix
>> drm_sched_job_add_implicit_dependencies harder' - it says there
>> "drm_sched_job_add_dependency() could drop the last ref" - this last
>> ref is the original refcount set by dma_fence_init->kref
>>
>> Andrey
>
>
> You can see that drm_sched_job_add_dependency() has three return paths
> they are [4], [5] and [1]. [4] and [5] will return 0. [1] will return
> error.
>
> There will be three weird problems if you're right:
>
> 1. [5] path will triger a refcount leak beacause ret is 0 in *if*[6].
Terminology confusion issue - [5] is a 'put' so it cannot cause a leak
by definition, extra unbalanced 'get' will cause a leak because memory
is never released, extra put will just probably cause a warning in
kref_put or maybe double free.
> Otherwise [2] and [5] are matching *get* and *put* in here.
Exactly, they are matching - so until this point all good and no 'leak'
then, no ?
>
> 2. [4] path need a additional dma_fence_get() to adds the fence as a
> job dependency. fence is from obj->resv. Taking msm as an example
> obj->resv is from etnaviv_ioctl_gem_submit()->submit_lookup_objects().
> It is not possible that an object has *refcount == 1* but is
> referenced in two places. So dma_fence_get() called in [2] is for [4].
> By the way, [3] don't execute in this case.
Still don't see the problem - [2] is the additional dma_fence_get() you
need here (just as you say above).
>
> 3. This one is a doubt. You can see in "[PATCH] drm/scheduler: fix
> drm_sched_job_add_implicit_dependencies harder".
> drm_sched_job_add_dependency() could drop the last ref, so we need to do
> the dma_fence_get() first. But the last ref still will drop in [3] if
> drm_sched_job_add_dependency() go path [1]. And there is only a
> *return* between [1] and [3]. Is this necessary? I think Rob Clark
> wants to avoid the last ref being dropped in
> drm_sched_job_add_implicit_dependencies() because fence is still used
> by obj->resv.
In the scenario above - if we go thorough path [1] refcount before [1]
starts is 2 - one from original kref_init and one from [2] and so it's
balanced against 2 puts (one from [1] and one from [3]) so I still don't
see a problem.
I suggest that you give a specific scenario from fence ref-count
perspective that your patch fixes. I might be wrong but unless you give
a specific case where the 'put' in [3] is redundant I just can't see it.
Andrey
>
>
> int drm_sched_job_add_dependency(struct drm_sched_job *job,
> struct dma_fence *fence)
> {
> ...
> xa_for_each(&job->dependencies, index, entry) {
> if (entry->context != fence->context)
> continue;
>
> if (dma_fence_is_later(fence, entry)) {
> dma_fence_put(entry);
> xa_store(&job->dependencies, index, fence,
> GFP_KERNEL); <---- [4]
> } else {
> dma_fence_put(fence); <---- [5]
> }
> return 0;
> }
>
> ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
> GFP_KERNEL);
> if (ret != 0)
> dma_fence_put(fence); <---- [1]
>
> return ret;
> }
>
>
> int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
> struct drm_gem_object *obj,
> bool write)
> {
> struct dma_resv_iter cursor;
> struct dma_fence *fence;
> int ret;
>
> dma_resv_for_each_fence(&cursor, obj->resv, write, fence) {
> /* Make sure to grab an additional ref on the added
> fence */
> dma_fence_get(fence); <---- [2]
> ret = drm_sched_job_add_dependency(job, fence);
> if (ret) { <---- [6]
> dma_fence_put(fence); <---- [3]
>
> return ret;
> }
> }
> return 0;
> }
>
> Thanks,
> hangyu
>
>>
>>>
>>>
>>> int drm_sched_job_add_dependency(struct drm_sched_job *job,
>>> struct dma_fence *fence)
>>> {
>>> ...
>>> ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b,
>>> GFP_KERNEL);
>>> if (ret != 0)
>>> dma_fence_put(fence); <--- [1]
>>>
>>> return ret;
>>> }
>>> EXPORT_SYMBOL(drm_sched_job_add_dependency);
>>>
>>>
>>> int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
>>> struct drm_gem_object *obj,
>>> bool write)
>>> {
>>> struct dma_resv_iter cursor;
>>> struct dma_fence *fence;
>>> int ret;
>>>
>>> dma_resv_for_each_fence(&cursor, obj->resv, write, fence) {
>>> /* Make sure to grab an additional ref on the added fence */
>>> dma_fence_get(fence); <--- [2]
>>> ret = drm_sched_job_add_dependency(job, fence);
>>> if (ret) {
>>> dma_fence_put(fence); <--- [3]
>>> return ret;
>>> }
>>> }
>>> return 0;
>>> }
>>>
>>>
>>>>
>>>>> On the other hand, dma_fence_get() and dma_fence_put() are
>>>>> meaningless here if threre is an extra dma_fence_get() beacause
>>>>> counter will not decrease to 0 during drm_sched_job_add_dependency().
>>>>>
>>>>> I check the call chain as follows:
>>>>>
>>>>> msm_ioctl_gem_submit()
>>>>> -> submit_fence_sync()
>>>>> -> drm_sched_job_add_implicit_dependencies()
>>>>
>>>>
>>>> Can you maybe trace or print one such example of problematic
>>>> refcount that you are trying to fix ? I still don't see where is
>>>> the problem.
>>>>
>>>> Andrey
>>>>
>>>
>>> I also wish I could. System logs can make this easy. But i don't
>>> have a corresponding GPU physical device.
>>> drm_sched_job_add_implicit_dependencies is only used in a few devices.
>>>
>>> Thanks.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Hangyu
>>>>>
>>>>>>
>>>>>>> return ret;
>>>>>>> }
>>>>>>> }
Powered by blists - more mailing lists