linux-kernel - Re: [PATCH v6 0/4] PCI: Add support for resetting the Root Ports in a platform specific way

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLmcO8ukT-CDZMuT@ryzen>
Date: Thu, 4 Sep 2025 16:03:39 +0200
From: Niklas Cassel <cassel@...nel.org>
To: Manivannan Sadhasivam <mani@...nel.org>
Cc: manivannan.sadhasivam@....qualcomm.com,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
	Oliver O'Halloran <oohall@...il.com>, Will Deacon <will@...nel.org>,
	Lorenzo Pieralisi <lpieralisi@...nel.org>,
	Krzysztof Wilczyński <kwilczynski@...nel.org>,
	Rob Herring <robh@...nel.org>, Heiko Stuebner <heiko@...ech.de>,
	Philipp Zabel <p.zabel@...gutronix.de>, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
	linux-arm-kernel@...ts.infradead.org, linux-arm-msm@...r.kernel.org,
	linux-rockchip@...ts.infradead.org,
	Wilfred Mallawa <wilfred.mallawa@....com>,
	Krishna Chaitanya Chundru <krishna.chundru@....qualcomm.com>,
	Lukas Wunner <lukas@...ner.de>
Subject: Re: [PATCH v6 0/4] PCI: Add support for resetting the Root Ports in
 a platform specific way

Hello Mani,

On Fri, Aug 29, 2025 at 09:44:08PM +0530, Manivannan Sadhasivam wrote:
> On Fri, Aug 15, 2025 at 11:07:42AM GMT, Niklas Cassel wrote:

(snip)

> > > > > ## On EP side:
> > > > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \
> > > > >   sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start
> > > > > 
> > > > > Basically all tests timeout
> > > > > # FAILED: 1 / 16 tests passed.
> > > > > 
> > > > > Which is the same as before this patch series.
> > > 
> > > This is kind of expected since the pci_endpoint_test driver doesn't have the AER
> > > err_handlers defined.
> > 
> > I see.
> > Would be nice if we could add them then, so that we can verify that this
> > series is working as intended.

(snip)

> Ok, thanks for the logs. I guess what is happening here is that we are not
> saving/restoring the config space of the devices under the Root Port if linkdown
> is happens. TBH, we cannot do that from the PCI core since once linkdown
> happens, we cannot access any devices underneath the Root Port. But if
> err_handlers are available for drivers for all devices, they could do something
> smart like below:
> 
> diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
> index c4e5e2c977be..9aabf1fe902e 100644
> --- a/drivers/misc/pci_endpoint_test.c
> +++ b/drivers/misc/pci_endpoint_test.c
> @@ -989,6 +989,8 @@ static int pci_endpoint_test_probe(struct pci_dev *pdev,
>  
>         pci_set_drvdata(pdev, test);
>  
> +       pci_save_state(pdev);
> +
>         id = ida_alloc(&pci_endpoint_test_ida, GFP_KERNEL);
>         if (id < 0) {
>                 ret = id;
> @@ -1140,12 +1142,31 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = {
>  };
>  MODULE_DEVICE_TABLE(pci, pci_endpoint_test_tbl);
>  
> +static pci_ers_result_t pci_endpoint_test_error_detected(struct pci_dev *pdev,
> +                                              pci_channel_state_t state)
> +{
> +       return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev)
> +{
> +       pci_restore_state(pdev);
> +
> +       return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static const struct pci_error_handlers pci_endpoint_test_err_handler = {
> +       .error_detected = pci_endpoint_test_error_detected,
> +       .slot_reset = pci_endpoint_test_slot_reset,
> +};
> +
>  static struct pci_driver pci_endpoint_test_driver = {
>         .name           = DRV_MODULE_NAME,
>         .id_table       = pci_endpoint_test_tbl,
>         .probe          = pci_endpoint_test_probe,
>         .remove         = pci_endpoint_test_remove,
>         .sriov_configure = pci_sriov_configure_simple,
> +       .err_handler    = &pci_endpoint_test_err_handler,
>  };
>  module_pci_driver(pci_endpoint_test_driver);
> 
> This essentially saves the good known config space during probe and restores it
> during the slot_reset callback. Ofc, the state would've been overwritten if
> suspend/resume happens in-between, but the point I'm making is that unless all
> device drivers restore their known config space, devices cannot be resumed
> properly post linkdown recovery.
> 
> I can add a patch based on the above diff in next revision if that helps. Right
> now, I do not have access to my endpoint test setup. So can't test anything.

I tested your patch series + your suggested change above, and after a:

## On EP side:
# echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \
  sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start

Instead of:

# FAILED: 1 / 16 tests passed.

I now get:
# FAILED: 7 / 16 tests passed.

Test cases 1-7 now passes (the test cases related to BARs),
all other test cases still fail:

# /pcitest 
TAP version 13
1..16
# Starting 16 tests from 9 test cases.
#  RUN           pci_ep_bar.BAR0.BAR_TEST ...
#            OK  pci_ep_bar.BAR0.BAR_TEST
ok 1 pci_ep_bar.BAR0.BAR_TEST
#  RUN           pci_ep_bar.BAR1.BAR_TEST ...
#            OK  pci_ep_bar.BAR1.BAR_TEST
ok 2 pci_ep_bar.BAR1.BAR_TEST
#  RUN           pci_ep_bar.BAR2.BAR_TEST ...
#            OK  pci_ep_bar.BAR2.BAR_TEST
ok 3 pci_ep_bar.BAR2.BAR_TEST
#  RUN           pci_ep_bar.BAR3.BAR_TEST ...
#            OK  pci_ep_bar.BAR3.BAR_TEST
ok 4 pci_ep_bar.BAR3.BAR_TEST
#  RUN           pci_ep_bar.BAR4.BAR_TEST ...
#      SKIP      BAR is disabled
#            OK  pci_ep_bar.BAR4.BAR_TEST
ok 5 pci_ep_bar.BAR4.BAR_TEST # SKIP BAR is disabled
#  RUN           pci_ep_bar.BAR5.BAR_TEST ...
#            OK  pci_ep_bar.BAR5.BAR_TEST
ok 6 pci_ep_bar.BAR5.BAR_TEST
#  RUN           pci_ep_basic.CONSECUTIVE_BAR_TEST ...
#            OK  pci_ep_basic.CONSECUTIVE_BAR_TEST
ok 7 pci_ep_basic.CONSECUTIVE_BAR_TEST
#  RUN           pci_ep_basic.LEGACY_IRQ_TEST ...
# pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Test failed for Legacy IRQ
# LEGACY_IRQ_TEST: Test failed
#          FAIL  pci_ep_basic.LEGACY_IRQ_TEST
not ok 8 pci_ep_basic.LEGACY_IRQ_TEST
#  RUN           pci_ep_basic.MSI_TEST ...
# pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI1
# pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI2
# pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI3
...


I think I know the reason.. you save the state before the IRQs have been allocated.

Perhaps we need to save the state after enabling IRQs?

I tried this patch on top of your patch:
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -851,6 +851,8 @@ static int pci_endpoint_test_set_irq(struct pci_endpoint_test *test,
                return ret;
        }
 
+       pci_save_state(pdev);
+
        return 0;
 }


But still:
# FAILED: 7 / 16 tests passed.

So... apparently that did not help...

I tried with the following change as well (on top of my patch above):

+static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev)
+{
+       struct pci_endpoint_test *test = pci_get_drvdata(pdev);
+       int irq_type = test->irq_type;
+
+       pci_restore_state(pdev);
+
+       if (irq_type != PCITEST_IRQ_TYPE_UNDEFINED) {
+               pci_endpoint_test_clear_irq(test);
+               pci_endpoint_test_set_irq(test, irq_type);
+       }
+
+       return PCI_ERS_RESULT_RECOVERED;
+}

But still only:
# FAILED: 7 / 16 tests passed.

Do you have any suggestions?


Kind regards,
Niklas