lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30f2480d-016f-417e-9ddf-7805e4943e7b@amd.com>
Date: Thu, 29 Jan 2026 12:38:30 +0100
From: Christian König <christian.koenig@....com>
To: Timur Kristóf <timur.kristof@...il.com>,
 Alex Deucher <alexdeucher@...il.com>,
 Hamza Mahfooz <someguy@...ective-light.com>,
 Michel Dänzer <michel.daenzer@...lbox.org>
Cc: Mario Limonciello <mario.limonciello@....com>,
 dri-devel@...ts.freedesktop.org, Alex Deucher <alexander.deucher@....com>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Harry Wentland <harry.wentland@....com>, Leo Li <sunpeng.li@....com>,
 Rodrigo Siqueira <siqueira@...lia.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 Sunil Khatri <sunil.khatri@....com>, Ce Sun <cesun102@....com>,
 Lijo Lazar <lijo.lazar@....com>, Kenneth Feng <kenneth.feng@....com>,
 Ivan Lipski <ivan.lipski@....com>, Alex Hung <alex.hung@....com>,
 Tom Chung <chiahsuan.chung@....com>, Melissa Wen <mwen@...lia.com>,
 Fangzhi Zuo <Jerry.Zuo@....com>, amd-gfx@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] drm: introduce page_flip_timeout()

On 1/29/26 12:25, Timur Kristóf wrote:
> On Thursday, January 29, 2026 11:06:11 AM Central European Standard Time 
> Michel Dänzer wrote:
>>>>>
>>>>> Christian, why would the CRTC be turned off?
>>>>
>>>> Exactly that's the question we need to answer.
>>>>
>>>> But from what you describe the CRTC keeps on, just doesn't send any more
>>>> vblank events.> 
>>> The vblank interrupt source getting accidentally disabled might be one
>>> possible cause though.
>> Another possibility is that test-only commits with the
>> DRM_MODE_ATOMIC_TEST_ONLY flag (which can happen in parallel while the
>> kernel is processing a "real" commit) accidentally have side effects on the
>> current kernel state, resulting in the "real" commit failing to do
>> something it should. There have been bugs like that in the amdgpu DM code
>> before.
>>
>>
>> Anyway, this is all speculation. Somebody just needs to dig in and get to
>> the bottom of why the commits aren't getting completed.
> 
> Yes, I agree.
> 
> However, just like we do with ring timeouts, we also need to be prepared for 
> the situation where a page flip timeout happens and we should try to recover 
> from it. And if it isn't recoverable, fall back to GPU reset.

No, that is clearly a bad idea. CRTCs are fixed function devices, that GPU reset helps here is just pure coincident.

What we can certainly do is to improve the error handling, e.g. that the system doesn't sit there forever after a page flip timeout.

> I strongly suspect that there are many different issues depending on the 
> hardware generation and display configuration. There isn't going to be a silver 
> bullet to fix all of them, and in case it cannot be fixed, I think a GPU reset 
> is the right thing to do - it's drastic, but still better than letting the 
> machine just freeze irrecoverably.
> 
> One example of such a bug was fixed by 6cbe6e072c5d where DC was trying to use 
> an interrupt that didn't exist on some hardware. This type of bug would be 
> impossible for userspace to solve in any way, but a GPU reset would have 
> helped to recover the machine into a usable state.
> 
> Another example would be Strix Halo with adaptive sync and/or tearing updates 
> enabled, which 100% reproduces a page flip timeout for me. I haven't had time 
> to investigate that one just yet.

Let's maybe try a complete different approach. We force a page flip timeout, and see if the system can handle that or not.

E.g. every 300 page flip we just fail to signal and see if things still work after the timeout.

Regards,
Christian.

> Timur
> 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ