[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d217128-b08c-4c56-ba7c-ed7b7ddfb75a@nvidia.com>
Date: Tue, 7 Nov 2023 08:44:53 +0530
From: Vidya Sagar <vidyas@...dia.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Lorenzo Pieralisi <lpieralisi@...nel.org>,
Vikram Sethi <vsethi@...dia.com>,
Thierry Reding <treding@...dia.com>,
Jonathan Hunter <jonathanh@...dia.com>,
Krishna Thota <kthota@...dia.com>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Question: Clearing error bits in the root port post enumeration
On 11/3/2023 11:50 PM, Bjorn Helgaas wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Oct 31, 2023 at 12:26:31PM +0000, Vidya Sagar wrote:
>> Hi folks,
>>
>> I would like to know your comments on the following scenario where
>> we are observing the root port logging errors because of the
>> enumeration flow being followed.
>>
>> DUT information:
>> - Has a root port and an endpoint connected to it
>> - Uses ECAM mechanism to access the configuration space
>> - Booted through ACPI flow
>> - Has a Firmware-First approach for handling the errors
>> - System is configured to treat Unsupported Requests as
>> AdvisoryNon-Fatal errors
>>
>> As we all know, when a configuration read request comes in for a
>> device number that is not implemented, a UR would be returned as per
>> the PCIe spec.
>>
>> As part of the enumeration flow on DUT, when the kernel reads offset
>> 0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID
>> and Device-ID values. But, when B:D:F=0:1:0 is probed, since there
>> is no device present there, the root port responds with an
>> Unsupported Request and simultaneously logs the same in the Device
>> Status register (i.e. bit-3). Because of it, there is a UR logged
>> in the Device Status register of the RP by the time enumeration is
>> complete.
>>
>> In the case of AER capability natively owned by the kernel, the AER
>> driver's init call would clear all such pending bits.
>>
>> Since we are going with the Firmware-First approach, and the system
>> is configured to treat Unsupported Requests as AdvisoryNon-Fatal
>> errors, only a correctable error interrupt can be raised to the
>> Firmware which takes care of clearing the corresponding status
>> registers. The firmware can't know about the UnsupReq bit being set
>> as the interrupt it received is for a correctable error hence it
>> clears only bits related to correctable error.
>>
>> All these events leave a freshly booted system with the following
>> bits set.
>>
>> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort)
>> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq)
>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq)
>>
>> Since the reason for UR is well understood at this point, I would
>> like to weigh in on the idea of clearing the aforementioned bits in
>> the root port once the enumeration is done particularly to cater to
>> the configurations where Firmware-First approach is in place.
>> Please let me know your comments on this approach.
>
> I think Secondary status (PCI_SEC_STATUS) is always owned by the OS
> and is not affected by _OSC negotiation, right? Linux does basically
> nothing with that today, but I think it *could* clear the "Received
> Master Abort" bit.
Yes. PCI_SEC_STATUS is always owned by the OS and _OSC negotiation
doesn't really affect that.
>
> I'm not very familiar with Advisory Non-Fatal errors. I'm curious
> about the UESta situation: why can't firmware know about UnsupReq
> being set? I assume PCI_ERR_COR_ADV_NFAT is the Correctable Error
> Status bit the firmware *does* see and clear.
Yes, PCI_ERR_COR_ADV_NFAT is indeed cleared by the firmware.
>
> But isn't the whole point of Advisory Non-Fatal errors that an error
> that is logged as an Uncorrectable Error and that normally would be
> signaled with ERR_NONFATAL is signaled with ERR_COR instead? So
> doesn't PCI_ERR_COR_ADV_NFAT being set imply that some
> PCI_ERR_UNCOR_STATUS must be set as well? If so, I would think
> firmware *could* figure that out and clear the PCI_ERR_UNCOR_STATUS
> bit.
So, are you suggesting that let the firmware only clear the
PCI_ERR_UNCOR_STATUS also? if so, then, I can even make the firmware
clear the PCI_SEC_STATUS also thereby leaving the firmware responsible
for clearing all the error bits. Does that sound ok?
>
> Bjorn
Powered by blists - more mailing lists