[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231127152641.GA5149@wunner.de>
Date: Mon, 27 Nov 2023 16:26:41 +0100
From: Lukas Wunner <lukas@...ner.de>
To: Vidya Sagar <vidyas@...dia.com>
Cc: Bjorn Helgaas <helgaas@...nel.org>,
Lorenzo Pieralisi <lpieralisi@...nel.org>,
Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@...ux.intel.com>,
"kbusch@...nel.org" <kbusch@...nel.org>,
Vikram Sethi <vsethi@...dia.com>,
Krishna Thota <kthota@...dia.com>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
sagar.tv@...il.com
Subject: Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg
Hi Vidya,
sorry for the delay, still catching up on e-mails after Plumbers...
On Fri, Nov 10, 2023 at 10:31:55PM +0530, Vidya Sagar wrote:
> > - System doesn't have support for in-band PD and supports only OOB PD
> > where writing to a private register would set the PD state
We already have an inband_presence_disabled flag in struct controller
which is set if the In-Band PD Disable Supported bit in the Slot
Capabilities 2 Register is set. The flag may also be set through the
inband_presence_disabled_dmi_table[].
Currently the only place where the flag makes a difference is on
slot bringup: pciehp_check_link_status() doesn't wait for the
Presence Detect Status bit to become set.
I'm wondering if we need to also disregard PDC events if In-Band PD
is disabled. Not sure if the behavior you're seeing is caused by a
quirk of the hardware or is expected if In-Band PD is disabled.
Probably the former. A code change would generally only be acceptable
in the latter case though I think.
> > 10. Since PDC (Presence Detect Change) bit is also set for the first
> > interrupt, IST attempts to remove the devices (as part of
> > pciehp_handle_presence_or_link_change())
> >
> > At this point, there is a race between the device driver that is
> > trying to work with the device (through pci_error_handlers callback)
> > and the IST that is trying to remove the device.
> > To be fair to pciehp_handle_presence_or_link_change(), after removing
> > the devices, it checks for the link-up/PD being '1' and scans the
> > devices again if the device is still available. But unfortunately,
> > IST is deadlocked (with the device driver) while removing the devices
> > itself and won't go to the next step.
Could you provide stacktraces of the two deadlocked tasks?
Right now I don't quite understand why they're deadlocked.
Are you getting hung task messages in dmesg?
They should include stacktraces.
Also, which kernel version are we talking about?
Thanks,
Lukas
Powered by blists - more mailing lists