[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fd3f2b16-41e7-4a9b-999d-e137e5abb7df@amd.com>
Date: Mon, 26 Jan 2026 11:14:50 +0100
From: Christian König <christian.koenig@....com>
To: Timur Kristóf <timur.kristof@...il.com>,
Hamza Mahfooz <someguy@...ective-light.com>, dri-devel@...ts.freedesktop.org
Cc: Alex Deucher <alexander.deucher@....com>, David Airlie
<airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Harry Wentland <harry.wentland@....com>, Leo Li <sunpeng.li@....com>,
Rodrigo Siqueira <siqueira@...lia.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
Sunil Khatri <sunil.khatri@....com>, Ce Sun <cesun102@....com>,
Lijo Lazar <lijo.lazar@....com>, Kenneth Feng <kenneth.feng@....com>,
Ivan Lipski <ivan.lipski@....com>, Alex Hung <alex.hung@....com>,
Tom Chung <chiahsuan.chung@....com>, Melissa Wen <mwen@...lia.com>,
Michel Dänzer <mdaenzer@...hat.com>,
Fangzhi Zuo <Jerry.Zuo@....com>, amd-gfx@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, Mario Limonciello <mario.limonciello@....com>
Subject: Re: [PATCH 1/2] drm: introduce page_flip_timeout()
On 1/23/26 15:44, Timur Kristóf wrote:
> On Friday, January 23, 2026 2:52:44 PM Central European Standard Time
> Christian König wrote:
>> On 1/23/26 01:05, Hamza Mahfooz wrote:
>>> There should be a mechanism for drivers to respond to flip_done
>>> time outs.
>>
>
> I am adding Harry and Mario to this email as they are more familiar with this.
>
>> I can only see two reasons why you could run into a timeout:
>>
>> 1. A dma_fence never signals.
>> How that should be handled is already well documented and doesn't
> require
>> any of this.
>
> Page flip timeouts have nothing to do with fence timeouts.
A page flip can be associated with a dma_fence as well. This is used for example on Android to immediately re-use the BO not scanned out any more.
That dma_fence has a well defined behavior, especially error handling and that it needs to signal in a finite amount of time even when driver doesn't confirm that the flip has completed.
The purpose of of the page flip timeout is now to guarantee that this defined behavior is actually fulfilled even when the driver does something bad.
> A page flip timeout can occur even when all fences of all job submissions
> complete correctly and on time.
But only when you have a major coding error in the driver. In other words a page flip can only timeout if there is no vblank or hblank signaling any more.
And that in turn means that somebody turned the CRTC off or hot plugged or we missed an (re-occuring) interrupt....
>> 2. A coding error in the vblank or page flip handler leading to waiting>> forever. In that case calling back into the driver doesn't help either.
>
> At the moment, a page flip timeout will leave the whole system in a hung state
> and the driver does not even attempt to recover it in any way, it just stops
> doing anything
Yeah and that is expected behavior. See there is no way the driver can recover from this on its own.
When this condition happens you can't take locks or allocate memory, otherwise you have the full potential to run into a very low level deadlock.
I'm not an expert for the atomic mode setting stuff but as far as I can see the only reasonable thing you can do is to set an error code on the page flip and return to userspace to signal that a flip has completed but with an error.
> which is unacceptable and I'm pretty surprised that it was
> left like that for so long.
> Note that we have approximately a hundred bug reports open on the drm/amd bug
> tracker about "random" page flip timeouts. It affects a lot of users.
Wow that is a boomer, I completely agree that sounds like a serious problem.
>> So as far as I can see the whole approach doesn't make any sense at all.
>
> Actually this approach was proposed as a solution at XDC 2025 in Harry's
> presentation, "DRM calls driver callback to attempt recovery", see page 9 in
> this slide deck:
>
> https://indico.freedesktop.org/event/10/contributions/431/attachments/
> 267/355/2025%20XDC%20Hackfest%20Update%20v1.2.pdf
>
> If you disagree with Harry, please make a counter-proposal.
Well I must have missed that detail otherwise I would have objected.
But looking at the slide Harry actually pointed out what immediately came to my mind as well, e.g. that the Compositor needs to issue a full modeset to re-program the CRTC.
We need to get the atomic mode setting experts on this and some suggesting on how to fix the error handling.
Regards,
Christian.
>
> Thanks,
> Timur
>
>
>
>>
>>> Since, as it stands it is possible for the display
>>> to stall indefinitely, necessitating a hard reset. So, introduce
>>> a new crtc callback that is called by
>>> drm_atomic_helper_wait_for_flip_done() to give drivers a shot
>>> at recovering from page flip timeouts.
>>>
>>> Signed-off-by: Hamza Mahfooz <someguy@...ective-light.com>
>>> ---
>>>
>>> drivers/gpu/drm/drm_atomic_helper.c | 6 +++++-
>>> include/drm/drm_crtc.h | 9 +++++++++
>>> 2 files changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c
>>> b/drivers/gpu/drm/drm_atomic_helper.c index 5840e9cc6f66..3a144c324b19
>>> 100644
>>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>>> @@ -1881,9 +1881,13 @@ void drm_atomic_helper_wait_for_flip_done(struct
>>> drm_device *dev,>
>>> continue;
>>>
>>> ret = wait_for_completion_timeout(&commit->flip_done, 10
> * HZ);
>>>
>>> - if (ret == 0)
>>> + if (!ret) {
>>>
>>> drm_err(dev, "[CRTC:%d:%s] flip_done timed
> out\n",
>>>
>>> crtc->base.id, crtc->name);
>>>
>>> +
>>> + if (crtc->funcs->page_flip_timeout)
>>> + crtc->funcs-
>> page_flip_timeout(crtc);
>>> + }
>>>
>>> }
>>>
>>> if (state->fake_commit)
>>>
>>> diff --git a/include/drm/drm_crtc.h b/include/drm/drm_crtc.h
>>> index 66278ffeebd6..45dc5a76e915 100644
>>> --- a/include/drm/drm_crtc.h
>>> +++ b/include/drm/drm_crtc.h
>>> @@ -609,6 +609,15 @@ struct drm_crtc_funcs {
>>>
>>> uint32_t flags, uint32_t target,
>>> struct drm_modeset_acquire_ctx
> *ctx);
>>>
>>> + /**
>>> + * @page_flip_timeout:
>>> + *
>>> + * This optional hook is called if &drm_crtc_commit.flip_done times
> out,
>>> + * and can be used by drivers to attempt to recover from a page
> flip
>>> + * timeout.
>>> + */
>>> + void (*page_flip_timeout)(struct drm_crtc *crtc);
>>> +
>>>
>>> /**
>>>
>>> * @set_property:
>>> *
>
>
>
>
Powered by blists - more mailing lists