linux-kernel - Re: [PATCH 1/2] drm: introduce page_flip

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2f9bc706-02d6-4dec-a56c-53abc5d43f46@amd.com>
Date: Thu, 29 Jan 2026 13:59:00 +0100
From: Christian König <christian.koenig@....com>
To: Timur Kristóf <timur.kristof@...il.com>,
 Alex Deucher <alexdeucher@...il.com>,
 Hamza Mahfooz <someguy@...ective-light.com>,
 Michel Dänzer <michel.daenzer@...lbox.org>
Cc: Mario Limonciello <mario.limonciello@....com>,
 dri-devel@...ts.freedesktop.org, Alex Deucher <alexander.deucher@....com>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Harry Wentland <harry.wentland@....com>, Leo Li <sunpeng.li@....com>,
 Rodrigo Siqueira <siqueira@...lia.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 Sunil Khatri <sunil.khatri@....com>, Ce Sun <cesun102@....com>,
 Lijo Lazar <lijo.lazar@....com>, Kenneth Feng <kenneth.feng@....com>,
 Ivan Lipski <ivan.lipski@....com>, Alex Hung <alex.hung@....com>,
 Tom Chung <chiahsuan.chung@....com>, Melissa Wen <mwen@...lia.com>,
 Fangzhi Zuo <Jerry.Zuo@....com>, amd-gfx@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] drm: introduce page_flip_timeout()

On 1/29/26 13:06, Timur Kristóf wrote:
> On Thursday, January 29, 2026 12:38:30 PM Central European Standard Time 
> Christian König wrote:
>>>
>>> However, just like we do with ring timeouts, we also need to be prepared
>>> for the situation where a page flip timeout happens and we should try to
>>> recover from it. And if it isn't recoverable, fall back to GPU reset.
>>
>> No, that is clearly a bad idea.
> 
> I don't see why it's "clearly" a bad idea. It's not clear to me at all, please 
> clarify it for me.

The GPU resets are necessary because we allow Turing complete programs to be submitted by userspace and that in turn is then messing up the HW state and we need to reset it to get into a known working state again (e.g. classic reset signal in electronics).

But in this case here when you see a frozen picture on the screen then that means the CRTC is still working, e.g. power is there, clocks are running, hblank, vblank is happening ... this doesn't looks like a HW failure at all.

After the input from Michel I'm pretty sure that what we have here is just messed up SW state, e.g. the DC/DM code has no fallback handling and not only misses the HW event but also blocks all further page flip requests from userspace which would resolve the issue.

>> CRTCs are fixed function devices that GPU
>> reset helps here is just pure coincident.
> 
> Currently, the driver doesn't handle page flip timeouts at all, which means 
> that if it happens, there is 0% chance of recovering from it.

Yeah and I completely agree that this is the absolutely worse thing we can do.

> If the GPU reset improves that chance to non-zero, it's already an 
> improvement, and already more than what AMD did to address this problem for 
> the past few years. I just find it incredibly disrespectful towards the 
> community that AMD proposes a solution that they neglect to implement, then 
> when somebody from the community steps up to implement it, it's rejected.

Well, I've heard about this problem just a few days ago.

>> What we can certainly do is to improve the error handling, e.g. that the
>> system doesn't sit there forever after a page flip timeout.
> 
> Sure.
> 
>>
>> Let's maybe try a complete different approach. We force a page flip timeout,
>> and see if the system can handle that or not.
>>
>> E.g. every 300 page flip we just fail to signal and see if things still work
>> after the timeout.
> 
> How do you propose to do that?

I need to dig a bit into the DAL/DC code and see how the signaling path actually goes.

Going to give that a try tomorrow.

Regards,
Christian.