[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aLmcO8ukT-CDZMuT@ryzen>
Date: Thu, 4 Sep 2025 16:03:39 +0200
From: Niklas Cassel <cassel@...nel.org>
To: Manivannan Sadhasivam <mani@...nel.org>
Cc: manivannan.sadhasivam@....qualcomm.com,
Bjorn Helgaas <bhelgaas@...gle.com>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Will Deacon <will@...nel.org>,
Lorenzo Pieralisi <lpieralisi@...nel.org>,
Krzysztof WilczyĆski <kwilczynski@...nel.org>,
Rob Herring <robh@...nel.org>, Heiko Stuebner <heiko@...ech.de>,
Philipp Zabel <p.zabel@...gutronix.de>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
linux-arm-kernel@...ts.infradead.org, linux-arm-msm@...r.kernel.org,
linux-rockchip@...ts.infradead.org,
Wilfred Mallawa <wilfred.mallawa@....com>,
Krishna Chaitanya Chundru <krishna.chundru@....qualcomm.com>,
Lukas Wunner <lukas@...ner.de>
Subject: Re: [PATCH v6 0/4] PCI: Add support for resetting the Root Ports in
a platform specific way
Hello Mani,
On Fri, Aug 29, 2025 at 09:44:08PM +0530, Manivannan Sadhasivam wrote:
> On Fri, Aug 15, 2025 at 11:07:42AM GMT, Niklas Cassel wrote:
(snip)
> > > > > ## On EP side:
> > > > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \
> > > > > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start
> > > > >
> > > > > Basically all tests timeout
> > > > > # FAILED: 1 / 16 tests passed.
> > > > >
> > > > > Which is the same as before this patch series.
> > >
> > > This is kind of expected since the pci_endpoint_test driver doesn't have the AER
> > > err_handlers defined.
> >
> > I see.
> > Would be nice if we could add them then, so that we can verify that this
> > series is working as intended.
(snip)
> Ok, thanks for the logs. I guess what is happening here is that we are not
> saving/restoring the config space of the devices under the Root Port if linkdown
> is happens. TBH, we cannot do that from the PCI core since once linkdown
> happens, we cannot access any devices underneath the Root Port. But if
> err_handlers are available for drivers for all devices, they could do something
> smart like below:
>
> diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
> index c4e5e2c977be..9aabf1fe902e 100644
> --- a/drivers/misc/pci_endpoint_test.c
> +++ b/drivers/misc/pci_endpoint_test.c
> @@ -989,6 +989,8 @@ static int pci_endpoint_test_probe(struct pci_dev *pdev,
>
> pci_set_drvdata(pdev, test);
>
> + pci_save_state(pdev);
> +
> id = ida_alloc(&pci_endpoint_test_ida, GFP_KERNEL);
> if (id < 0) {
> ret = id;
> @@ -1140,12 +1142,31 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = {
> };
> MODULE_DEVICE_TABLE(pci, pci_endpoint_test_tbl);
>
> +static pci_ers_result_t pci_endpoint_test_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> +{
> + return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev)
> +{
> + pci_restore_state(pdev);
> +
> + return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static const struct pci_error_handlers pci_endpoint_test_err_handler = {
> + .error_detected = pci_endpoint_test_error_detected,
> + .slot_reset = pci_endpoint_test_slot_reset,
> +};
> +
> static struct pci_driver pci_endpoint_test_driver = {
> .name = DRV_MODULE_NAME,
> .id_table = pci_endpoint_test_tbl,
> .probe = pci_endpoint_test_probe,
> .remove = pci_endpoint_test_remove,
> .sriov_configure = pci_sriov_configure_simple,
> + .err_handler = &pci_endpoint_test_err_handler,
> };
> module_pci_driver(pci_endpoint_test_driver);
>
> This essentially saves the good known config space during probe and restores it
> during the slot_reset callback. Ofc, the state would've been overwritten if
> suspend/resume happens in-between, but the point I'm making is that unless all
> device drivers restore their known config space, devices cannot be resumed
> properly post linkdown recovery.
>
> I can add a patch based on the above diff in next revision if that helps. Right
> now, I do not have access to my endpoint test setup. So can't test anything.
I tested your patch series + your suggested change above, and after a:
## On EP side:
# echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \
sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start
Instead of:
# FAILED: 1 / 16 tests passed.
I now get:
# FAILED: 7 / 16 tests passed.
Test cases 1-7 now passes (the test cases related to BARs),
all other test cases still fail:
# /pcitest
TAP version 13
1..16
# Starting 16 tests from 9 test cases.
# RUN pci_ep_bar.BAR0.BAR_TEST ...
# OK pci_ep_bar.BAR0.BAR_TEST
ok 1 pci_ep_bar.BAR0.BAR_TEST
# RUN pci_ep_bar.BAR1.BAR_TEST ...
# OK pci_ep_bar.BAR1.BAR_TEST
ok 2 pci_ep_bar.BAR1.BAR_TEST
# RUN pci_ep_bar.BAR2.BAR_TEST ...
# OK pci_ep_bar.BAR2.BAR_TEST
ok 3 pci_ep_bar.BAR2.BAR_TEST
# RUN pci_ep_bar.BAR3.BAR_TEST ...
# OK pci_ep_bar.BAR3.BAR_TEST
ok 4 pci_ep_bar.BAR3.BAR_TEST
# RUN pci_ep_bar.BAR4.BAR_TEST ...
# SKIP BAR is disabled
# OK pci_ep_bar.BAR4.BAR_TEST
ok 5 pci_ep_bar.BAR4.BAR_TEST # SKIP BAR is disabled
# RUN pci_ep_bar.BAR5.BAR_TEST ...
# OK pci_ep_bar.BAR5.BAR_TEST
ok 6 pci_ep_bar.BAR5.BAR_TEST
# RUN pci_ep_basic.CONSECUTIVE_BAR_TEST ...
# OK pci_ep_basic.CONSECUTIVE_BAR_TEST
ok 7 pci_ep_basic.CONSECUTIVE_BAR_TEST
# RUN pci_ep_basic.LEGACY_IRQ_TEST ...
# pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Test failed for Legacy IRQ
# LEGACY_IRQ_TEST: Test failed
# FAIL pci_ep_basic.LEGACY_IRQ_TEST
not ok 8 pci_ep_basic.LEGACY_IRQ_TEST
# RUN pci_ep_basic.MSI_TEST ...
# pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI1
# pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI2
# pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110)
# pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI3
...
I think I know the reason.. you save the state before the IRQs have been allocated.
Perhaps we need to save the state after enabling IRQs?
I tried this patch on top of your patch:
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -851,6 +851,8 @@ static int pci_endpoint_test_set_irq(struct pci_endpoint_test *test,
return ret;
}
+ pci_save_state(pdev);
+
return 0;
}
But still:
# FAILED: 7 / 16 tests passed.
So... apparently that did not help...
I tried with the following change as well (on top of my patch above):
+static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev)
+{
+ struct pci_endpoint_test *test = pci_get_drvdata(pdev);
+ int irq_type = test->irq_type;
+
+ pci_restore_state(pdev);
+
+ if (irq_type != PCITEST_IRQ_TYPE_UNDEFINED) {
+ pci_endpoint_test_clear_irq(test);
+ pci_endpoint_test_set_irq(test, irq_type);
+ }
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
But still only:
# FAILED: 7 / 16 tests passed.
Do you have any suggestions?
Kind regards,
Niklas
Powered by blists - more mailing lists