[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6733693f-64b2-47fa-97ba-4ebba3edef35@intel.com>
Date: Mon, 23 Jun 2025 08:26:46 -0700
From: Daniele Ceraolo Spurio <daniele.ceraolospurio@...el.com>
To: "Nilawar, Badal" <badal.nilawar@...el.com>,
<intel-xe@...ts.freedesktop.org>, <dri-devel@...ts.freedesktop.org>,
<linux-kernel@...r.kernel.org>
CC: <anshuman.gupta@...el.com>, <rodrigo.vivi@...el.com>,
<alexander.usyskin@...el.com>, <gregkh@...uxfoundation.org>, <jgg@...dia.com>
Subject: Re: [PATCH v3 06/10] drm/xe/xe_late_bind_fw: Reload late binding fw
in rpm resume
On 6/18/2025 10:52 PM, Nilawar, Badal wrote:
>
> On 19-06-2025 02:35, Daniele Ceraolo Spurio wrote:
>>
>>
>> On 6/18/2025 12:00 PM, Badal Nilawar wrote:
>>> Reload late binding fw during runtime resume.
>>>
>>> v2: Flush worker during runtime suspend
>>>
>>> Signed-off-by: Badal Nilawar <badal.nilawar@...el.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_late_bind_fw.c | 2 +-
>>> drivers/gpu/drm/xe/xe_late_bind_fw.h | 1 +
>>> drivers/gpu/drm/xe/xe_pm.c | 6 ++++++
>>> 3 files changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> b/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> index 54aa08c6bdfd..c0be9611c73b 100644
>>> --- a/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> +++ b/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> @@ -58,7 +58,7 @@ static int xe_late_bind_fw_num_fans(struct
>>> xe_late_bind *late_bind)
>>> return 0;
>>> }
>>> -static void xe_late_bind_wait_for_worker_completion(struct
>>> xe_late_bind *late_bind)
>>> +void xe_late_bind_wait_for_worker_completion(struct xe_late_bind
>>> *late_bind)
>>> {
>>> struct xe_device *xe = late_bind_to_xe(late_bind);
>>> struct xe_late_bind_fw *lbfw;
>>> diff --git a/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> b/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> index 28d56ed2bfdc..07e437390539 100644
>>> --- a/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> +++ b/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> @@ -12,5 +12,6 @@ struct xe_late_bind;
>>> int xe_late_bind_init(struct xe_late_bind *late_bind);
>>> int xe_late_bind_fw_load(struct xe_late_bind *late_bind);
>>> +void xe_late_bind_wait_for_worker_completion(struct xe_late_bind
>>> *late_bind);
>>> #endif
>>> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
>>> index ff749edc005b..91923fd4af80 100644
>>> --- a/drivers/gpu/drm/xe/xe_pm.c
>>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>>> @@ -20,6 +20,7 @@
>>> #include "xe_gt.h"
>>> #include "xe_guc.h"
>>> #include "xe_irq.h"
>>> +#include "xe_late_bind_fw.h"
>>> #include "xe_pcode.h"
>>> #include "xe_pxp.h"
>>> #include "xe_trace.h"
>>> @@ -460,6 +461,8 @@ int xe_pm_runtime_suspend(struct xe_device *xe)
>>> if (err)
>>> goto out;
>>> + xe_late_bind_wait_for_worker_completion(&xe->late_bind);
>>
>> I thing this can deadlock, because you do an rpm_put from within the
>> worker and if that's the last put it'll end up here and wait for the
>> worker to complete.
>> We could probably just skip this wait, because the worker can handle
>> rpm itself. What we might want to be careful about is to nor re-queue
>> it (from xe_late_bind_fw_load below) if it's currently being
>> executed; we could also just let the fw be loaded twice if we hit
>> that race condition, that shouldn't be an issue apart from doing
>> something not needed.
>
> In xe_pm_runtime_get/_put, deadlocks are avoided by verifying the
> condition (xe_pm_read_callback_task(xe) == current).
Isn't that for calls to rpm_get/put done from within the
rpm_suspend/resume code? This is not the case here, we're not
deadlocking on the rpm lock, we're deadlocking on the worker.
The error flow as I see it here would be as follow:
rpm refcount is 1, owned by thread X
worker starts
worker takes rpm [rpm refcount now 2]
thread X releases rpm [rpm refcount now 1]
worker releases rpm [rpm refcount now 0]
rpm_suspend is called from within the worker
xe_pm_write_callback_task is called
flush_work is called -> deadlock
I don't see how the callback_task() code can block the flush_work from
deadlocking here.
Also, what happens if when the worker starts the rpm refcount is 0?
Assuming the deadlock issue is not there.
worker starts
worker takes rpm [rpm refcount now 1]
rpm_resume is called
worker is re-queued
worker releases rpm [rpm refcount now 0]
worker exits
worker re-starts -> go back to beginning
This second issue should be easily fixed by using pm_get_if_in_use from
the worker, to not load the late_bind table if we're rpm_suspended since
we'll do it when someone else resumes the device.
Daniele
>
> Badal
>
>>
>> Daniele
>>
>>> +
>>> /*
>>> * Applying lock for entire list op as xe_ttm_bo_destroy and
>>> xe_bo_move_notify
>>> * also checks and deletes bo entry from user fault list.
>>> @@ -550,6 +553,9 @@ int xe_pm_runtime_resume(struct xe_device *xe)
>>> xe_pxp_pm_resume(xe->pxp);
>>> + if (xe->d3cold.allowed)
>>> + xe_late_bind_fw_load(&xe->late_bind);
>>> +
>>> out:
>>> xe_rpm_lockmap_release(xe);
>>> xe_pm_write_callback_task(xe, NULL);
>>
Powered by blists - more mailing lists