linux-kernel - Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZDgIZmCgLoC/uieX@lpieralisi>
Date:   Thu, 13 Apr 2023 15:49:26 +0200
From:   Lorenzo Pieralisi <lpieralisi@...nel.org>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>
Cc:     Rick Wertenbroek <rick.wertenbroek@...il.com>,
        alberto.dassatti@...g-vd.ch, xxm@...k-chips.com,
        rick.wertenbroek@...g-vd.ch, Rob Herring <robh+dt@...nel.org>,
        Krzysztof Kozlowski <krzysztof.kozlowski+dt@...aro.org>,
        Heiko Stuebner <heiko@...ech.de>,
        Shawn Lin <shawn.lin@...k-chips.com>,
        Krzysztof Wilczyński <kw@...ux.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Jani Nikula <jani.nikula@...el.com>,
        Rodrigo Vivi <rodrigo.vivi@...el.com>,
        Mikko Kovanen <mikko.kovanen@...amobile.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        devicetree@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        linux-rockchip@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-pci@...r.kernel.org
Subject: Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint
 controller driver

On Fri, Mar 17, 2023 at 07:09:04AM +0900, Damien Le Moal wrote:
> On 3/17/23 01:34, Rick Wertenbroek wrote:
> >>> By the way, enabling the interrupts to see the error notifications, I do see a
> >>> lot of retry timeout and other recoverable errors. So the issues I am seeing
> >>> could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
> >>> ... ?). Not sure. I do not have a PCI analyzer handy :)
> > 
> > I have enabled the IRQs and messages thanks to your patches but I don't get
> > messages from the IRQs (it seems no IRQs are fired). My PCIe link seems stable.
> > The main issue I face is still that after a random amount of time, the BARs are
> > reset to 0, I don't have a PCIe analyzer so I cannot chase config space TLPs
> > (e.g., host writing the BAR values to the config header), but I don't think that
> > the problem comes from a TLP issued from the host. (it might be).
> 
> Hmmm... I am getting lots of IRQs, especially the ones signaling "replay timer
> timed out" and "replay timer rolled over after 4 transmissions of the same TLP"
> but also some "phy error detected on receive side"... Need to try to rework my
> cable setup I guess.
> 
> As for the BARs being reset to 0, I have not checked, but it may be why I see
> things not working after some inactivity. Will check that. We may be seeing the
> same regarding that.
> 
> > I don't think it's a buffer overflow / out-of-bounds access by kernel
> > code for two reasons
> > 1) The values in the config space around the BARs is coherent and unchanged
> > 2) The bars are reset to 0 and not a random value
> > 
> > I suspect a hardware reset of those registers issued internally in the
> > PCIe controller,
> > I don't know why (it might be a link related event or power state
> > related event).
> > 
> > I have also experienced very slow behavior with the PCI endpoint test driver,
> > e.g., pcitest -w 1024 -d would take tens of seconds to complete. It seems to
> > come from LCRC errors, when I check the "LCRC Error count register"
> > @0xFD90'0214 I can see it drastically increase between two calls of pcitest
> > (when I mean drastically it means by 6607 (0x19CF) for example).
> > 
> > The "ECC Correctable Error Count Register" @0xFD90'0218 reads 0 though.
> > 
> > I have tried to shorten the cabling by removing one of the PCIe extenders, that
> > didn't change the issues much.
> > 
> > Any ideas as to why I see a large number of TLPs with LCRC errors in them ?
> > Do you experience the same ? What are your values in 0xFD90'0214 when
> > running e.g., pcitest -w 1024 -d (note: you can reset the counter by writing
> > 0xFFFF to it in case it reaches the maximum value of 0xFFFF).
> 
> I have not checked. But I will look at these counters to see what I have there.

Hi,

checking where are we with this thread and whether there is something to
consider for v6.4, if testing succeeds.

Thanks,
Lorenzo