lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFZQkGxFhbVAf-S98r_27NKtezdUiNtaA=cd7ATsVcX5iRManw@mail.gmail.com>
Date: Thu, 29 Jan 2026 22:39:10 +0100
From: Xaver Hugl <xaver.hugl@....org>
To: Christian König <christian.koenig@....com>
Cc: Michel Dänzer <michel.daenzer@...lbox.org>, 
	Timur Kristóf <timur.kristof@...il.com>, 
	Hamza Mahfooz <someguy@...ective-light.com>, dri-devel@...ts.freedesktop.org, 
	Alex Deucher <alexander.deucher@....com>, David Airlie <airlied@...il.com>, 
	Simona Vetter <simona@...ll.ch>, Harry Wentland <harry.wentland@....com>, Leo Li <sunpeng.li@....com>, 
	Rodrigo Siqueira <siqueira@...lia.com>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>, 
	Sunil Khatri <sunil.khatri@....com>, Ce Sun <cesun102@....com>, Lijo Lazar <lijo.lazar@....com>, 
	Kenneth Feng <kenneth.feng@....com>, Ivan Lipski <ivan.lipski@....com>, 
	Alex Hung <alex.hung@....com>, Tom Chung <chiahsuan.chung@....com>, 
	Melissa Wen <mwen@...lia.com>, Michel Dänzer <mdaenzer@...hat.com>, 
	Fangzhi Zuo <Jerry.Zuo@....com>, amd-gfx@...ts.freedesktop.org, 
	linux-kernel@...r.kernel.org, Mario Limonciello <mario.limonciello@....com>
Subject: Re: [PATCH 1/2] drm: introduce page_flip_timeout()

> Then second even if the kernel can do it I'm not sure if it should do it.
>
> I mean userspace asked for a quick page flip and not some expensive CRTC/PLL reprogramming. Stuff like that usually takes some time and by then the frame which should be displayed by the page flip might already be stale and it would be better to tell userspace that we couldn't display it on time and wait for a new frame to be generated.

I would personally prefer a new "pageflip failed" event, which the
compositor can react to appropriately.
For compositors not opting into that new API, the kernel automatically
fixing things would be nice though. Even pretending the pageflip
completed and then failing the next one with EINVAL would be enough to
trigger a modeset in the case of KWin.

> And third, there must be a root cause of the page flip not completing.
>
> My educated guess is that we have some atomic property change or even turning the CRTC off in parallel with the page flip. I mean HW rarely turns off its reoccurring vblank interrupt on its own.
>
> Returning an error to userspace might actually help identify the root cause.

There are two things I know that trigger pageflip timeouts frequently.

On dedicated GPUs, most of them seem to happen when a game causes a GPU reset.
In some cases, it seemed like the pageflip did complete, but the
kernel never sent the pageflip event to userspace. It also rejected
new atomic commits of the compositor with EBUSY - but a new instance
of the compositor could still do atomic commits just fine.
In other cases, triggering another GPU reset was necessary to recover,
and in yet other cases it was just broken beyond repair.
Presumably, all of them are caused by bugs in the GPU reset sequence.
As another piece of information on that, KWin only does atomic commits
once the fences of the involved buffers are signaled, and it does not
use OUT_FENCE_FD. So fence signaling should not be relevant, at least
not on the KMS uAPI level.

On APUs, the vast majority are caused by PSR. I know many AMD laptop
users that run with "amdgpu.dcdebugmask=0x10" to disable PSR as a
workaround, and have never seen a pageflip timeout since setting that
option. I have even seen a freeze *without* a pageflip timeout in
testing PSR again on my laptop recently, so PSR seems to have even
bigger issues.
Pageflip timeout or not, manually triggering a GPU reset seems to be a
reliable way to recover from it.
IMO that one is bad and widespread enough that PSR should be disabled
by default on the relevant AMD hardware until it no longer causes such
problems.

On the topic of whether or not this is just a thing the driver has to
fix, this isn't as exclusive to amdgpu as it might seem. i915 has some
pageflip timeout issues with apparently still unknown causes
(https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14604), and
the proprietary Nvidia driver had one some time ago that IIRC was
caused by firmware bugs.

Obviously, drivers still need to be fixed, but the bug is bad enough
for the end user that a fallback would be very helpful. If userspace
gets notified about it, we can still direct users to the relevant bug
trackers to get the underlying bugs hopefully fixed either way.

- Xaver

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ