[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a8b972be-7265-492f-9855-cdec94a0e0dc@amd.com>
Date: Fri, 23 Jan 2026 16:30:12 -0600
From: Mario Limonciello <mario.limonciello@....com>
To: Timur Kristóf <timur.kristof@...il.com>,
Hamza Mahfooz <someguy@...ective-light.com>,
dri-devel@...ts.freedesktop.org, Christian König
<christian.koenig@....com>
Cc: Alex Deucher <alexander.deucher@....com>, David Airlie
<airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Harry Wentland <harry.wentland@....com>, Leo Li <sunpeng.li@....com>,
Rodrigo Siqueira <siqueira@...lia.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
Sunil Khatri <sunil.khatri@....com>, Ce Sun <cesun102@....com>,
Lijo Lazar <lijo.lazar@....com>, Kenneth Feng <kenneth.feng@....com>,
Ivan Lipski <ivan.lipski@....com>, Alex Hung <alex.hung@....com>,
Tom Chung <chiahsuan.chung@....com>, Melissa Wen <mwen@...lia.com>,
Michel Dänzer <mdaenzer@...hat.com>,
Fangzhi Zuo <Jerry.Zuo@....com>, amd-gfx@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] drm: introduce page_flip_timeout()
On 1/23/2026 8:44 AM, Timur Kristóf wrote:
> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time
> Christian König wrote:
>> On 1/23/26 01:05, Hamza Mahfooz wrote:
>>> There should be a mechanism for drivers to respond to flip_done
>>> time outs.
>>
>
When there is a display hang, I think that resetting the GPU IP is
really heavy handed. I second what Alex said - Why not instead just
reset DCN? I would think move DCN into D3 and back out should be enough
if trying to use something to recover.
> I am adding Harry and Mario to this email as they are more familiar with this.
>
>> I can only see two reasons why you could run into a timeout:
>>
>> 1. A dma_fence never signals.
>> How that should be handled is already well documented and doesn't
> require
>> any of this.
>
> Page flip timeouts have nothing to do with fence timeouts.
> A page flip timeout can occur even when all fences of all job submissions
> complete correctly and on time.
>
>>
>> 2. A coding error in the vblank or page flip handler leading to waiting
>> forever. In that case calling back into the driver doesn't help either.
>
> At the moment, a page flip timeout will leave the whole system in a hung state
> and the driver does not even attempt to recover it in any way, it just stops
> doing anything, which is unacceptable and I'm pretty surprised that it was
> left like that for so long.
>
> Note that we have approximately a hundred bug reports open on the drm/amd bug
> tracker about "random" page flip timeouts. It affects a lot of users.
Yeah I would much rather leave some messages in the log that this
happened and see a recovery occur than a hang.
>
>>
>> So as far as I can see the whole approach doesn't make any sense at all.
>
> Actually this approach was proposed as a solution at XDC 2025 in Harry's
> presentation, "DRM calls driver callback to attempt recovery", see page 9 in
> this slide deck:
>
> https://indico.freedesktop.org/event/10/contributions/431/attachments/
> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf
>
> If you disagree with Harry, please make a counter-proposal.
Hamza - since you seem to have a "workload" that can run overnight and
this series recovers, can you try what Alex said and do a dc_suspend()
and dc_resume() for failure?
Make sure you log a message so you can know it worked.
>
> Thanks,
> Timur
>
>
>
>>
>>> Since, as it stands it is possible for the display
>>> to stall indefinitely, necessitating a hard reset. So, introduce
>>> a new crtc callback that is called by
>>> drm_atomic_helper_wait_for_flip_done() to give drivers a shot
>>> at recovering from page flip timeouts.
>>>
>>> Signed-off-by: Hamza Mahfooz <someguy@...ective-light.com>
>>> ---
>>>
>>> drivers/gpu/drm/drm_atomic_helper.c | 6 +++++-
>>> include/drm/drm_crtc.h | 9 +++++++++
>>> 2 files changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c
>>> b/drivers/gpu/drm/drm_atomic_helper.c index 5840e9cc6f66..3a144c324b19
>>> 100644
>>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>>> @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct
>>> drm_device *dev,>
>>> continue;
>>>
>>> ret = wait_for_completion_timeout(&commit->flip_done, 10
> * HZ);
>>>
>>> - if (ret == 0)
>>> + if (!ret) {
>>>
>>> drm_err(dev, "[CRTC:%d:%s] flip_done timed
> out\n",
>>>
>>> crtc->base.id, crtc->name);
>>>
>>> +
>>> + if (crtc->funcs->page_flip_timeout)
>>> + crtc->funcs-
>> page_flip_timeout(crtc);
>>> + }
>>>
>>> }
>>>
>>> if (state->fake_commit)
>>>
>>> diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h
>>> index 66278ffeebd6..45dc5a76e915 100644
>>> --- a/include/drm/drm_crtc.h
>>> +++ b/include/drm/drm_crtc.h
>>> @@ -609,6 +609,15 @@ struct drm_crtc_funcs {
>>>
>>> uint32_t flags, uint32_t target,
>>> struct drm_modeset_acquire_ctx
> *ctx);
>>>
>>> + /**
>>> + * @page_flip_timeout:
>>> + *
>>> + * This optional hook is called if &drm_crtc_commit.flip_done times
> out,
>>> + * and can be used by drivers to attempt to recover from a page
> flip
>>> + * timeout.
>>> + */
>>> + void (*page_flip_timeout)(struct drm_crtc *crtc);
>>> +
>>>
>>> /**
>>>
>>> * @set_property:
>>> *
>
>
>
>
Powered by blists - more mailing lists