[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLFmSFe5iyYDrIjt@wunner.de>
Date: Fri, 29 Aug 2025 10:35:20 +0200
From: Lukas Wunner <lukas@...ner.de>
To: Brian Norris <briannorris@...omium.org>
Cc: manivannan.sadhasivam@....qualcomm.com,
Bjorn Helgaas <bhelgaas@...gle.com>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Will Deacon <will@...nel.org>,
Lorenzo Pieralisi <lpieralisi@...nel.org>,
Krzysztof Wilczynski <kwilczynski@...nel.org>,
Manivannan Sadhasivam <mani@...nel.org>,
Rob Herring <robh@...nel.org>, Heiko Stuebner <heiko@...ech.de>,
Philipp Zabel <p.zabel@...gutronix.de>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
linux-arm-kernel@...ts.infradead.org, linux-arm-msm@...r.kernel.org,
linux-rockchip@...ts.infradead.org,
Niklas Cassel <cassel@...nel.org>,
Wilfred Mallawa <wilfred.mallawa@....com>,
Krishna Chaitanya Chundru <krishna.chundru@....qualcomm.com>
Subject: Re: [PATCH v6 2/4] PCI: host-common: Add link down handling for Root
Ports
On Thu, Aug 28, 2025 at 01:25:12PM -0700, Brian Norris wrote:
> On the flip side: it's not clear
> PCI_ERS_RESULT_NEED_RESET+pci_channel_io_normal works as documented
> either. An endpoint might think it's requesting a slot reset, but
> pcie_do_recovery() will ignore that and skip reset_subordinates()
> (pci_host_reset_root_port()).
>
> All in all, the docs sound like endpoints _should_ have control over
> whether we exercise a full port/slot reset for all types of errors. But
> in practice, we do not actually give it that control. i.e., your commit
> message is correct, and the docs are not.
>
> I have half a mind to suggest the appended change, so the behavior
> matches (some of) the docs a little better [1].
A change similar to the one you're proposing is already queued on the
pci/aer topic branch for v6.18:
https://git.kernel.org/pci/pci/c/d0a2dee7d458
Here's the corresponding cover letter:
https://lore.kernel.org/r/cover.1755008151.git.lukas@wunner.de
There was a discussion why I didn't take the exact same approach you're
proposing, but only a similar one:
https://lore.kernel.org/r/aJ2uE6v46Zib30Jh@wunner.de
https://lore.kernel.org/r/aKHWf3L0NCl_CET5@wunner.de
> Specifically, I'm trying to see what's supposed to happen with
> PCI_ERS_RESULT_CAN_RECOVER. I see that for pci_channel_io_frozen, almost
> all endpoint drivers return PCI_ERS_RESULT_NEED_RESET, but if drivers
> actually return PCI_ERS_RESULT_CAN_RECOVER, it's unclear what should
> happen.
>
> Today, we don't actually respect it; pcie_do_recovery() just calls
> reset_subordinates() (pci_host_reset_root_port()) unconditionally. The
> only thing that return code affects is whether we call
> report_mmio_enabled() vs report_slot_reset() afterward. This seems odd.
In the series queued on pci/aer, I've only allowed drivers to opt in
to a reset on Non-Fatal Errors. I didn't dare also letting them opt
out of a reset on Fatal Errors.
These changes of behavior are always risky, so it seemed prudent to not
introduce too many changes at once. There was no urgent need to also
change behavior for Fatal Errors for the use case at hand (the xe graphics
driver). I went through all drivers with pci_error_handlers to avoid
breaking any of them. It's very tedious work, takes weeks. It would
be necessary to do that again when changing behavior for Fatal Errors.
pcieaer-howto.rst justifies the unconditional reset on Fatal Errors by
saying that the link is unreliable and that a reset is thus required.
On the other hand, pci-error-recovery.rst (which is a few months older
than pcieaer-howto.rst) says in section "STEP 3: Link Reset":
"This is a PCIe specific step and is done whenever a fatal error has been
detected"
I'm wondering if the authors of pcieaer-howto.rst took that at face value
and thought they'd *have* to reset the link on Fatal Errors.
Looking through the Fatal Errors in PCIe r7.0 sec 6.2.7, I think a reset
is justified for some of them, but optional for others. Which leads me
to believe that the AER driver should actually enforce a reset only for
certain Fatal Errors, not all of them. So this seems like something
worth revisiting in the future.
> All in all, the docs sound like endpoints _should_ have control over
> whether we exercise a full port/slot reset for all types of errors. But
> in practice, we do not actually give it that control. i.e., your commit
> message is correct, and the docs are not.
Indeed the documentation is no longer in sync with the code. I've just
submitted a series to rectify that and cc'ed you:
https://lore.kernel.org/r/cover.1756451884.git.lukas@wunner.de
Thanks,
Lukas
Powered by blists - more mailing lists