linux-kernel - Re: [PATCH 00/12] Recover sysfb after DRM probe failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97993761-5884-4ada-b345-9fb64819e02a@suse.de>
Date: Thu, 15 Jan 2026 12:02:33 +0100
From: Thomas Zimmermann <tzimmermann@...e.de>
To: Zack Rusin <zack.rusin@...adcom.com>
Cc: dri-devel@...ts.freedesktop.org, Alex Deucher
 <alexander.deucher@....com>, amd-gfx@...ts.freedesktop.org,
 Ard Biesheuvel <ardb@...nel.org>, Ce Sun <cesun102@....com>,
 Chia-I Wu <olvaffe@...il.com>, Christian König
 <christian.koenig@....com>, Danilo Krummrich <dakr@...nel.org>,
 Dave Airlie <airlied@...hat.com>, Deepak Rawat <drawat.floss@...il.com>,
 Dmitry Osipenko <dmitry.osipenko@...labora.com>,
 Gerd Hoffmann <kraxel@...hat.com>,
 Gurchetan Singh <gurchetansingh@...omium.org>,
 Hans de Goede <hansg@...nel.org>, Hawking Zhang <Hawking.Zhang@....com>,
 Helge Deller <deller@....de>, intel-gfx@...ts.freedesktop.org,
 intel-xe@...ts.freedesktop.org, Jani Nikula <jani.nikula@...ux.intel.com>,
 Javier Martinez Canillas <javierm@...hat.com>,
 Jocelyn Falempe <jfalempe@...hat.com>,
 Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
 Lijo Lazar <lijo.lazar@....com>, linux-efi@...r.kernel.org,
 linux-fbdev@...r.kernel.org, linux-hyperv@...r.kernel.org,
 linux-kernel@...r.kernel.org, Lucas De Marchi <lucas.demarchi@...el.com>,
 Lyude Paul <lyude@...hat.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 "Mario Limonciello (AMD)" <superm1@...nel.org>,
 Mario Limonciello <mario.limonciello@....com>,
 Maxime Ripard <mripard@...nel.org>, nouveau@...ts.freedesktop.org,
 Rodrigo Vivi <rodrigo.vivi@...el.com>, Simona Vetter <simona@...ll.ch>,
 spice-devel@...ts.freedesktop.org,
 Thomas Hellström <thomas.hellstrom@...ux.intel.com>,
 Timur Kristóf <timur.kristof@...il.com>,
 Tvrtko Ursulin <tursulin@...ulin.net>, virtualization@...ts.linux.dev,
 Vitaly Prosyak <vitaly.prosyak@....com>
Subject: Re: [PATCH 00/12] Recover sysfb after DRM probe failure

Hi,

apologies for the delay. I wanted to reply and then forgot about it.

Am 10.01.26 um 05:52 schrieb Zack Rusin:
> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@...e.de> wrote:
>> Hi
>>
>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>> Almost a rite of passage for every DRM developer and most Linux users
>>> is upgrading your DRM driver/updating boot flags/changing some config
>>> and having DRM driver fail at probe resulting in a blank screen.
>>>
>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>> access to PCI resources so if the probe fails the system is left without
>>> a functioning display driver.
>>>
>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>> fails. This means that a DRM driver that fails to load reloads the system
>>> framebuffer driver.
>>>
>>> This works best with simpledrm. Without it Xorg won't recover because
>>> it still tries to load the vendor specific driver which ends up usually
>>> not working at all. With simpledrm the system recovers really nicely
>>> ending up with a working console and not a blank screen.
>>>
>>> There's a caveat in that some hardware might require some special magic
>>> register write to recover EFI display. I'd appreciate it a lot if
>>> maintainers could introduce a temporary failure in their drivers
>>> probe to validate that the sysfb recovers and they get a working console.
>>> The easiest way to double check it is by adding:
>>>    /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>    dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>    ret = -EINVAL;
>>>    goto out_error;
>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>> Recovering the display like that is guess work and will at best work
>> with simple discrete devices where the framebuffer is always located in
>> a confined graphics aperture.
>>
>> But the problem you're trying to solve is a real one.
>>
>> What we'd want to do instead is to take the initial hardware state into
>> account when we do the initial mode-setting operation.
>>
>> The first step is to move each driver's remove_conflicting_devices call
>> to the latest possible location in the probe function. We usually do it
>> first, because that's easy. But on most hardware, it could happen much
>> later.
> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
> they request pci regions which is going to fail otherwise. Because
> grabbining the pci resources is in general the very first thing that
> those drivers need to do to setup anything, we
> remove_conflicting_devices first or at least very early.

To my knowledge, requesting resources is more about correctness than a 
hard requirement to use an I/O or memory range. Has this changed?


>
> I also don't think it's possible or even desirable by some drivers to
> reuse the initial state, good example here is vmwgfx where by default
> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
> loads we allow scanning out from system memory, so you can set your vm
> up with 8mb of vram but still use 4k resolutions when the driver
> loads, this way the suspend size of the vm is very predictable (tiny
> vram plus whatever ram was setup) while still allowing a lot of
> flexibility.

If there's no initial state to switch from, the first modeset can fail 
while leaving the display unusable. There's no way around that. Going 
back to the old state is not an option unless the driver has been 
written to support this.

The case of vmwgfx is special, but does not effect the overall problem. 
For vmwgfx, it would be best to import that initial state and support a 
transparent modeset from vram to system memory (and back) at least 
during this initial state.


>
> In general I think however this is planned it's two or three separate series:
> 1) infrastructure to reload the sysfb driver (what this series is)
> 2) making sure that drivers that do want to recover cleanly actually
> clean out all the state on exit properly,
> 3) abstracting at least some of that cleanup in some driver independent way

That's really not going to work. For example, in the current series, you 
invoke devm_aperture_remove_conflicting_pci_devices_done() after 
drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of 
these calls can modify hardware state. In the case of _register() and 
_setup(), the DRM clients can perform a modeset, which destroys the 
initial hardware state. Patch 1 of this series removes the sysfb 
device/driver entirely. That should be a no-go as it significantly 
complicates recovery. For example, if the native drivers failed from an 
allocation failure, the sysfb device/driver is not likely to come back 
either. As the very first thing, the series should state which failures 
is is going to resolve, - failed hardware init, - invalid initial 
modesetting, - runtime errors (such ENOMEM, failed firmware loading), - 
others? And then specify how a recovery to sysfb could look in each 
supported scenario. In terms of implementation, make any transition 
between drivers gradually. The native driver needs to acquire the 
hardware resource (framebuffer and I/O apertures) without unloading the 
sysfb driver. Luckily there's struct drm_device.unplug, which does that. 
[1] Flipping this field disables hardware access for DRM drivers. All 
sysfb drivers support this. To get the sysfb drivers ready, I suggest 
dedicated helpers for each drivers aperture. The aperture helpers can 
use these callback to flip the DRM driver off and on again. For example, 
efidrm could do this as a minimum: int efidrm_aperture_suspend() { 
dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } 
int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) 
dev->unplug = false; return 0 } struct aperture_funcs 
efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = 
efidrm_aperture_resume, } Pass this struct when efidrm acquires the 
framebuffer aperture, so that the aperture helpers can control the 
behavior of efidrm. With this, a multi-step takeover from sysfb to 
native driver can be tried. It's still a massive effort that requires an 
audit of each driver's probing logic. There's no copy-paste pattern 
AFAICT. I suggest to pick one simple driver first and make a prototype. 
Let me also say that I DO like the general idea you're proposing. But if 
it was easy, we would likely have done it already. Best regards Thomas
>
> z

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)