linux-kernel - Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bb59edee909ceb09527cedec10896d45126f0027.camel@linux.ibm.com>
Date: Tue, 14 Oct 2025 14:07:57 +0200
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Lukas Wunner <lukas@...ner.de>
Cc: Farhan Ali <alifm@...ux.ibm.com>, Benjamin Block <bblock@...ux.ibm.com>,
        linux-s390@...r.kernel.org, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
        alex.williamson@...hat.com, helgaas@...nel.org, clg@...hat.com,
        mjrosato@...ux.ibm.com
Subject: Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space

On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
> On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote:
> > On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote:
> > > And yet you're touching the device by trying to reset it.
> > > 
> > > The code you're introducing in patch [01/10] only becomes necessary
> > > because you're not following the above-quoted protocol.  If you
> > > follow the protocol, patch [01/10] becomes unnecessary.
> > 
> > I agree with your point above error_detected() should not touch the
> > device. My understanding of Farhan's series though is that it follows
> > that rule. As I understand it error_detected() is only used to inject
> > the s390 specific PCI error event into the VM using the information
> > stored in patch 7. As before vfio-pci returns
> > PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7
> > the pass-through case is detected and this gets turned into
> > PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets
> > skipped. And yeah, writing it down I'm not super happy with this part,
> > maybe it would be better to have an explicit
> > PCI_ERS_RESULT_LEAVE_AS_IS.
> 
> Thanks, that's the high-level overview I was looking for.
> 
> It would be good to include something like this at least
> in the cover letter or additionally in the commit messages
> so that it's easier for reviewers to connect the dots.
> 
> I was expecting paravirtualized error handling, i.e. the
> VM is aware it's virtualized and vfio essentially proxies
> the pci_ers_result return value of the driver (e.g. nvme)
> back to the host, thereby allowing the host to drive error
> recovery normally.  I'm not sure if there are technical
> reasons preventing such an approach.

It does sound technically feasible but sticking to the already
architected error reporting and recovery has clear advantages. For one
it will work with existing Linux versions without guest changes and it
also has precedent with it working already in the z/VM hypervisor for
years. I agree that there is some level of mismatch with Linux'
recovery support but I don't think that outweighs having a clean
virtualization support where the host and guest use the same interface.

> 
> If you do want to stick with your alternative approach,
> maybe doing the error handling in the ->mmio_enabled() phase
> instead of ->error_detected() would make more sense.
> In that phase you're allowed to access the device,
> you can also attempt a local reset and return
> PCI_ERS_RESULT_RECOVERED on success.
> 
> You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
> from the ->error_detected() callback in order to progress
> to the ->mmio_enabled() step.
> 
> Does that make sense?
> 
> Thanks,
> 
> Lukas

The problem with using ->mmio_enabled() is two fold. For one we
sometimes have to do a reset instead of clearing the error state, for
example if the device was not only put in the error state but also
disabled, or if the guest driver wants it, so we would also have to use
->slot_reset() and could end up with two resets. Second and more
importantly this would break the guests assumption that the device will
be in the error state with MMIO and DMA blocked when it gets an error
event. On the other hand, that's exactly the state it is in if we
report the error in the ->error_detected() callback and then skip the
rest of recovery so it can be done in the guest, likely with the exact
same Linux code. I'd assume this should be similar if QEMU/KVM wanted
to virtualize AER+DPC except that there MMIO remains accessible?

Here's an idea. Could it be an option to detect the pass-through in the
vfio-pci driver's ->error_detected() callback, possibly with feedback
from QEMU (@Alex?), and then return PCI_ERS_RESULT_RECOVERED from there
skipping the rest of recovery?

The skipping would be in-line with the below section of the
documentation i.e. "no further intervention":

  - PCI_ERS_RESULT_RECOVERED
      Driver returns this if it thinks the device is usable despite
      the error and does not need further intervention.

It's just that in this case the device really remains with MMIO and DMA
blocked, usable only in the sense that the vfio-pci + guest VM combo
knows how to use a device with MMIO and DMA blocked with the guest
recovery.

Thanks,
Niklas