linux-kernel - Re: [PATCH] PCI: cadence: Fixed cdns_pcie_host_link

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241219112051.pjr3a4evtftlpxau@thinkpad>
Date: Thu, 19 Dec 2024 16:50:51 +0530
From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
To: Hans Zhang <18255117159@....com>
Cc: Siddharth Vadapalli <s-vadapalli@...com>, lpieralisi@...nel.org,
	kw@...ux.com, robh@...nel.org, bhelgaas@...gle.com,
	thomas.richard@...tlin.com, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, rockswang7@...il.com
Subject: Re: [PATCH] PCI: cadence: Fixed cdns_pcie_host_link_setup return
 value.

On Thu, Dec 19, 2024 at 05:29:01AM -0500, Hans Zhang wrote:
> 
> 
> On 12/19/24 04:49, Manivannan Sadhasivam wrote:
> > On Thu, Dec 19, 2024 at 04:38:11AM -0500, Hans Zhang wrote:
> > > 
> > > On 12/19/24 03:59, Siddharth Vadapalli wrote:
> > > > On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
> > > > > On 12/19/24 03:33, Siddharth Vadapalli wrote:
> > > > > > On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
> > > > > > > If the PCIe link never came up, the enumeration process
> > > > > > > should not be run.
> > > > > > The link could come up at a later point in time. Please refer to the
> > > > > > implementation of:
> > > > > > dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
> > > > > > wherein we have the following:
> > > > > > 	/* Ignore errors, the link may come up later */
> > > > > > 	dw_pcie_wait_for_link(pci);
> > > > > > 
> > > > > > It seems to me that the logic behind ignoring the absence of the link
> > > > > > within cdns_pcie_host_link_setup() instead of erroring out, is similar to
> > > > > > that of dw_pcie_wait_for_link().
> > > > > > 
> > > > > > Regards,
> > > > > > Siddharth.
> > > > > > 
> > > > > > 
> > > > > > If a PCIe port is not connected to a device. The PCIe link does not
> > > > > > go up. The current code returns success whether the device is connected
> > > > > > or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
> > > > > > config space registers. Otherwise the enumeration process will hang.
> > > > The ">" symbols seem to be manually added in your reply and are also
> > > > incorrect. If you have added them manually, please don't add them at the
> > > > start of the sentences corresponding to your reply.
> > > > 
> > > > The issue you are facing seems to be specific to the Cadence IP or the way
> > > > in which the IP has been integrated into the device that you are using.
> > > > On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
> > > > doesn't result in a hang. Enumeration should proceed irrespective of the
> > > > presence of PCIe devices and should indicate their absence when they aren't
> > > > connected.
> > > > 
> > > > While I am not denying the issue being seen, the fix should probably be
> > > > done elsewhere.
> > > > 
> > > > Regards,
> > > > Siddharth.
> > > We are the SOC design company and we have confirmed with the designer and
> > > Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM to be
> > > used. Cadence will fixed next RTL version.
> > > 
> > 
> > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do not
> > connect the device, LTSSM would still be in 'detect' state until the device is
> > connected. Is that different on your SoC?
> > 
> > > If the cdns_pcie_host_link_setup return value is not modified. The following
> > > is the
> > > log of the enumeration process without connected devices. There will be hang
> > > for
> > > more than 300 seconds. So I don't think it makes sense to run the
> > > enumeration
> > > process without connecting devices. And it will affect the boot time.
> > > 
> > 
> > We don't know your driver, so cannot comment on the issue without understanding
> > the problem, sorry.
> > 
> > - Mani
> > 
> > > [ 2.681770] xxx pcie: xxx_pcie_probe starting!
> > > [ 2.689537] xxx pcie: host bridge /soc@...cie@xxx ranges:
> > > [ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
> > > [ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
> > > [ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
> > > [ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff],
> > > vaddr:ffff800089390000
> > > [ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff],
> > > vaddr:ffff800089800000
> > > [ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
> > > [ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
> > > [ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
> > > [ 3.905292] xxx pcie: Link fail, retries 10 times
> > > [ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> > > [ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> > > [ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
> > > [ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
> > > [ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus
> > > address [0x60100000-0x601fffff])
> > > [ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
> > > [ 3.968799] pci_bus 0000:c0: root bus resource [mem
> > > 0x1800000000-0x1bffffffff pref]
> > > [ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [ 339.677449] rcu: 5-...0: (20 ticks this GP) idle=4d94/1/0x4000000000000000
> > > softirq=20/20 fqs=2623
> > > [ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
> > > [ 339.696193] Sending NMI from CPU 2 to CPUs 5:
> > > [ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 2509
> > > jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > > [ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
> > > [ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 f0x0
> > > RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
> > > [ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM
> > > is now expected behavior.
> > > [ 349.750546] rcu: RCU grace-period kthread stack dump:
> > > [ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2
> > > flags:0x00000008
> > > [ 349.767439] Call trace:
> > > [ 349.771575] __switch_to+0xdc/0x150
> > > [ 349.776777] __schedule+0x2dc/0x7d0
> > > [ 349.781972] schedule+0x5c/0x100
> > > [ 349.786903] schedule_timeout+0x8c/0x100
> > > [ 349.792538] rcu_gp_fqs_loop+0x140/0x420
> > > [ 349.798176] rcu_gp_kthread+0x134/0x164
> > > [ 349.803725] kthread+0x108/0x10c
> > > [ 349.808657] ret_from_fork+0x10/0x20
> > > [ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
> > > [ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S xxx-build-generic
> > > #8
> > > [ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
> > > [ 349.843583] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > > BTYPE=--)
> > > [ 349.852294] pc : arch_cpu_idle+0x18/0x2c
> > > [ 349.857928] lr : arch_cpu_idle+0x14/0x2c
> > > 
> > > Regards Hans
> > > 
> > 
> 
> I am very sorry that the previous email said that I included HTML format, so
> I resend it twice.
> 
> 
> > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do
> not
> > connect the device, LTSSM would still be in 'detect' state until the
> device is
> > connected. Is that different on your SoC?
> 
> If a PCIe port is not connected to a device. Then run pci_host_probe and
> perform the enumeration process. During the enumeration process, VID and PID
> are read. If the LTSSM is not in L0, the CPU send AXI transmission will not
> be sent, that is, the AXI slave will hang. This is the problem with the
> Cadence IP we are using.
> 

This sounds similar to the issues we have seen with other IP implementations:

15b23906347c ("PCI: dwc: Add link up check in dw_child_pcie_ops.map_bus()")
9e9ec8d8692a ("PCI: keystone: Add link up check to ks_pcie_other_map_bus()")

If the config space access happens for devices that do not exist on the bus,
then SError gets triggered and it causes the system hang.

In that case, you need to skip the enumeration in your own
'struct pci_ops::map_bus' callback. Even though it is not the best solution, we
have to live with it.

- Mani

-- 
மணிவண்ணன் சதாசிவம்