linux-kernel - Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a73a2156-124b-47f7-b545-6751f6e87d54@nvidia.com>
Date:   Mon, 27 Nov 2023 20:13:50 +0530
From:   Vidya Sagar <vidyas@...dia.com>
To:     Bjorn Helgaas <helgaas@...nel.org>,
        Lorenzo Pieralisi <lpieralisi@...nel.org>,
        Lukas Wunner <lukas@...ner.de>,
        Sathyanarayanan Kuppuswamy 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        "kbusch@...nel.org" <kbusch@...nel.org>
Cc:     Vikram Sethi <vsethi@...dia.com>,
        Krishna Thota <kthota@...dia.com>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        sagar.tv@...il.com
Subject: Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg

Hi Bjorn and others,
Could you please share your thoughts on this?

Thanks,
Vidya Sagar

On 11/10/2023 10:31 PM, Vidya Sagar wrote:
> There was a typo. Corrected it.
> 
> s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also
> 
> On 11/10/2023 6:30 PM, Vidya Sagar wrote:
>> Hi folks,
>> Here are the platform details.
>> - System with a Firmware First approach for both AER and DPC
>> - CPERs are sent to the OS for the AER events
>> - DPC is notified to the OS through EDR mechanism
>> - System doesn't have support for in-band PD and supports only OOB PD 
>> where writing to a private register would set the PD state
>> - System has this design where PD state gets cleared whenever there is 
>> a link down (without even writing to the private register)
>>
>> In this system, if the root port has to experience a DPC because of 
>> any right reasons (say SW trigger of DPC for time being), I'm 
>> observing the following flow which is causing a race.
>> 1. Since DPC brings the link down, there is a DLLSC and an MSI is sent 
>> to the OS hence PCIe HP ISR is invoked.
>> 2. ISR then takes stock of the Slot_Status register to see what all 
>> events caused the MSI generation.
>> 3. It sees both DLLSC and PDC bits getting set.
>> 4. PDC is set because of the aforementioned hardware design where for 
>> every link down, PD state gets cleared and since this is considered as 
>> a change in the PD state, PDC also gets set.
>> 5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt 
>> thread/bottom half) and waits for the DPC_EDR to complete (because 
>> DLLSC is one of the events)
>> 6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the 
>> firmware and that would then send an EDR event to the OS. Firmware 
>> also sets the PD state to '1' before that, as the endpoint device is 
>> still available.
>> 7. Firmware programming of the private register to set the PD state 
>> raises second interrupt and PCIe HP ISR takes stock of the events and 
>> this time it is only the PDC and not DLLSC. ISR sends a wake to the 
>> IST, but since there is an IST in progress already, nothing much 
>> happens at this point.
>> 8. Once the DPC is completed and link comes back up again, DPC 
>> framework asks the endpoint drivers to handle it by calling the 
>> 'pci_error_handlers' callabacks registered by the endpoint device 
>> drivers.
>> 9. The PCIe HP IST (of the very first MSI) resumes after receiving the 
>> completion from the DPC framework saying that DPC recovery is successful.
>> 10. Since PDC (Presence Detect Change) bit is also set for the first 
>> interrupt, IST attempts to remove the devices (as part of 
>> pciehp_handle_presence_or_link_change())
>>
>> At this point, there is a race between the device driver that is 
>> trying to work with the device (through pci_error_handlers callback) 
>> and the IST that is trying to remove the device.
>> To be fair to pciehp_handle_presence_or_link_change(), after removing 
>> the devices, it checks for the link-up/PD being '1' and scans the 
>> devices again if the device is still available. But unfortunately, IST 
>> is deadlocked (with the device driver) while removing the devices 
>> itself and won't go to the next step.
>>
>> The rationale given in the form of a comment in 
>> pciehp_handle_presence_or_link_change() for removing the devices first 
>> (without checking PD/link-up) and adding them back after checking 
>> link-up/PD is that, since there is a change in PD, the new link-up 
>> need not be with the same old device rather it could be with a 
>> different device. So, it justifies in removing the existing devices 
>> and then scanning for new ones. But this is causing deadlock with the 
>> already initiated DPC recovery process.
>>
>> I see two ways to avoid the deadlock in this scenario.
>> 1. When PCIe HP IST is looking at both DLLSC and PDC, why should 
>> DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks 
>> 'PD Change' also (along with DLL) and declares that the DPC recovery 
>> can't happen (as there could be a change of device) hence returning 
>> DPC recovery failure, then, the device driver's pci_error_handlers 
>> callbacks won't be called and there won't be any deadlock.
>> 2. Check for the possibility of a common lock so that PCIe HP IST and 
>> device driver's pci_error_handlers callbacks don't race.
>>
>> Let me know your comments on this.
>>
>> Thanks,
>> Vidya Sagar