linux-kernel - Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f4776c0eac3c004d36677377525662d75752ebd.camel@linux.ibm.com>
Date: Mon, 20 Oct 2025 10:59:48 +0200
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Lukas Wunner <lukas@...ner.de>
Cc: Farhan Ali <alifm@...ux.ibm.com>, Benjamin Block <bblock@...ux.ibm.com>,
        linux-s390@...r.kernel.org, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
        alex.williamson@...hat.com, helgaas@...nel.org, clg@...hat.com,
        mjrosato@...ux.ibm.com
Subject: Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space

On Sun, 2025-10-19 at 16:34 +0200, Lukas Wunner wrote:
> On Tue, Oct 14, 2025 at 02:07:57PM +0200, Niklas Schnelle wrote:
> > On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
> > > If you do want to stick with your alternative approach,
> > > maybe doing the error handling in the ->mmio_enabled() phase
> > > instead of ->error_detected() would make more sense.
> > > In that phase you're allowed to access the device,
> > > you can also attempt a local reset and return
> > > PCI_ERS_RESULT_RECOVERED on success.
> > > 
> > > You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
> > > from the ->error_detected() callback in order to progress
> > > to the ->mmio_enabled() step.
> > 
> > The problem with using ->mmio_enabled() is two fold. For one we
> > sometimes have to do a reset instead of clearing the error state, for
> > example if the device was not only put in the error state but also
> > disabled, or if the guest driver wants it,
> 
> Well in that case you could reset the device in the ->mmio_enabled() step
> from the guest using the vfio reset ioctl.
> 
> > Second and more
> > importantly this would break the guests assumption that the device will
> > be in the error state with MMIO and DMA blocked when it gets an error
> > event. On the other hand, that's exactly the state it is in if we
> > report the error in the ->error_detected() callback
> 
> At the risk of continuously talking past each other:
> 
> How about this, the host notifies the guest of the error in the
> ->error_detected() callback.  The guest notifies the driver and
> collects the result (whether a reset is requested or not), then
> returns PCI_ERS_RESULT_CAN_RECOVER to the host.
> 
> The host re-enables I/O to the device, invokes the ->mmio_detected()
> callback.  The guest then resets the device based on the result it
> collected earlier or invokes the driver's ->mmio_enabled() callback.
> 
> If the driver returns PCI_ERS_RESULT_NEED_RESET from the
> ->mmio_enabled() callback, you can likewise reset the device from
> the guest using the ioctl method.
> 
> My concern is that by insisting that you handle device recovery
> completely in the ->error_detected() phase, you're not complying
> with the protocol specified in Documentation/PCI/pci-error-recovery.rst
> and as a result, you have to amend the reset code in the PCI core
> because it assumes that all arches adheres to the protocol.
> In my view, that suggests that the approach needs to be reworked
> to comply with the protocol.  Then the workarounds for performing
> a reset while I/O is blocked become unnecessary.
> 
> Thanks,
> 
> Lukas

Yeah I think we're talking past each other a bit. In my mind we're
really not doing the recovery in ->error_detected() at all. Within that
callback we only do the notify, and then do nothing in the rest of
recovery. Only after will the guest do recovery though I do see your
point that leaving the device in the error state kind of means that
recovery is still ongoing even if we're not in the recovery handler
anymore. But then any driver could also just return
PCI_ERS_RESULT_RECOVERED in error_detected() and land us in the same
situation. And at least on s390 there is also the case that the device
actually enters the error state before we get the error event so we
could already race into a situation where the guest does a reset and
the device is already in the error state but we haven't seen the event
yet. And this seems inherent to hardware blocking I/O on error which by
it's nature can happen at any time.

But let's put that aside, say we want to implement your model where we
do check with the guest and its device driver. How would that work,
somehow error_detected() would have to wait for the guest to proceed
into recovery and since the guest could just not do that we'd have to
have some kind of timeout. Also we can't allow the guest to choose
PCI_ERS_RESULT_RECOVERED because otherwise we'd again be in the
situation where recovery is completed without unblocking I/O. And if we
want to stick to the architecture QEMU/KVM will have to kind of have a
mode where after being informed of ongoing recovery for a device they
intercept attempts to reset / firmware calls for reset and turn that
into the correct return. And somehow also deal with the timeout because
e.g. old Linux guests won't do recovery but there is also no
architected way for a guest to say that it does recovery.

To me this seems worse than accepting that even with PCI error
recovery,  resets can encounter PCI devices with blocked I/O which is
already true now if e.g. user-space happens to drive a reset racing
with hardware I/O blocking.

Thanks,
Niklas