lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <21ef5524-738a-43d5-bc9a-87f907a8aa70@linux.ibm.com>
Date: Wed, 8 Oct 2025 10:56:35 -0700
From: Farhan Ali <alifm@...ux.ibm.com>
To: Lukas Wunner <lukas@...ner.de>
Cc: Benjamin Block <bblock@...ux.ibm.com>, linux-s390@...r.kernel.org,
        kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-pci@...r.kernel.org, alex.williamson@...hat.com,
        helgaas@...nel.org, clg@...hat.com, schnelle@...ux.ibm.com,
        mjrosato@...ux.ibm.com
Subject: Re: [PATCH v4 01/10] PCI: Avoid saving error values for config space


On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> On Mon, Oct 06, 2025 at 02:35:49PM -0700, Farhan Ali wrote:
>> On 10/6/2025 12:26 PM, Lukas Wunner wrote:
>>> On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
>>>> On 10/4/2025 7:54 AM, Lukas Wunner wrote:
>>>>> I believe this also makes patch [01/10] in your series unnecessary.
>>>> I tested your patches + patches 2-10 of this series. It unfortunately didn't
>>>> completely help with the s390x use case. We still need the check to in
>>>> pci_save_state() from this patch to make sure we are not saving error
>>>> values, which can be written back to the device in pci_restore_state().
>>> What's the caller of pci_save_state() that needs this?
>>>
>>> Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
>>> I think plenty of other callers don't need this, so it adds
>>> extra overhead for them and down the road it'll be difficult
>>> to untangle which caller needs it and which doesn't.
>> The caller would be pci_dev_save_and_disable(). Are you suggesting moving
>> the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?
> I'm not sure yet.  Let's back up a little:  I'm missing an
> architectural description how you're intending to do error
> recovery in the VM.  If I understand correctly, you're
> informing the VM of the error via the ->error_detected() callback.
>
> You're saying you need to check for accessibility of the device
> prior to resetting it from the VM, does that mean you're attempting
> a reset from the ->error_detected() callback?
>
> According to Documentation/PCI/pci-error-recovery.rst, the device
> isn't supposed to be considered accessible in ->error_detected().
> The first callback which allows access is ->mmio_enabled().
>
> I also don't quite understand why the VM needs to perform a reset.
> Why can't you just let the VM tell the host that a reset is needed
> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> behalf of the VM?

The ->error_detected() callback is used to inform userspace of an error. 
In the case of a VM, using QEMU as a userspace, once notified of an 
error QEMU will inject an error into the guest in s390x architecture 
specific way [1] (probably should have linked the QEMU series in the 
cover letter). Once notified of the error VM's device driver will drive 
the recovery action. The recovery action require a reset of the device 
and on s390x PCI devices are reset using architecture specific 
instructions (zpci_device_hot_reset()). QEMU will intercept any low 
level recovery instructions from the VM and then perform a reset of 
device on behalf of the VM [2]. QEMU performs a reset by invoking the 
VFIO_DEVICE_RESET ioctl, which attempts to reset the device 
using pci_try_reset_function().

Once a device is in an error state, MMIO to the device is blocked and so 
any PCI reads to the Config Space will return -1. Since 
pci_try_reset_function() will try to save the state of device's Config 
Space with pci_dev_save_and_disable(), it will end up saving -1 as the 
state. Later when we try to restore the state after a reset, we end up 
corrupting device registers which can send the device into an error 
state again. I was trying to avoid this with the patch.

Hopefully, this answers your questions.

[1] QEMU series 
https://lore.kernel.org/all/20250925174852.1302-1-alifm@linux.ibm.com/

[2] v1 patch series discussion on some nuances of reset mechanism 
https://lore.kernel.org/all/20250814145743.204ca19a.alex.williamson@redhat.com/

>
> Furthermore, I'm thinking that you should be using pci_channel_offline()
> to detect accessibility of the device, rather than reading from
> Config Space and checking for PCI_POSSIBLE_ERROR().
>
>>> The state saved on device addition is just the initial state and
>>> it is fine if later on it gets updated (which is a nicer term than
>>> "overwritten").  E.g. when portdrv.c instantiates port services
>>> and drivers are bound to them, various registers in Config Space
>>> are changed, hence pcie_portdrv_probe() calls pci_save_state()
>>> again.
>>>
>>> However we can discuss whether pci_save_state() is still needed
>>> in pci_dev_save_and_disable().
>> The commit 8dd7f8036c12 ("PCI: add support for function level reset")
>> introduced the logic of saving/restoring the device state after an FLR. My
>> assumption is it was done to save the most recent state of the device (as
>> the state could be updated by drivers). So I think it would still make sense
>> to save the device state in pci_dev_save_and_disable() if the Config Space
>> is still accessible?
> Yes, right now we can't assume that drivers call pci_save_state()
> in their probe hook if they modified Config Space.  They may rely
> on the state being saved prior to reset or a D3hot/D3cold transition.
> So we need to keep the pci_dev_save_and_disable() call for now.
>
> Generally the expectation is that Config Space is accessible when
> performing a reset with pci_try_reset_function().  Since that's
> apparently not guaranteed for your use case, I'm wondering if you
> might be using the function in a context it's not supposed to be used.

I am open to suggestions on how we can do this.

Thanks

Farhan


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ