lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1e8184e9-7e0b-2598-cc5a-e46d6c2f152a@opensource.wdc.com>
Date:   Wed, 15 Mar 2023 09:00:40 +0900
From:   Damien Le Moal <damien.lemoal@...nsource.wdc.com>
To:     Rick Wertenbroek <rick.wertenbroek@...il.com>
Cc:     alberto.dassatti@...g-vd.ch, xxm@...k-chips.com,
        rick.wertenbroek@...g-vd.ch, Rob Herring <robh+dt@...nel.org>,
        Krzysztof Kozlowski <krzysztof.kozlowski+dt@...aro.org>,
        Heiko Stuebner <heiko@...ech.de>,
        Shawn Lin <shawn.lin@...k-chips.com>,
        Lorenzo Pieralisi <lpieralisi@...nel.org>,
        Krzysztof WilczyƄski <kw@...ux.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Jani Nikula <jani.nikula@...el.com>,
        Rodrigo Vivi <rodrigo.vivi@...el.com>,
        Mikko Kovanen <mikko.kovanen@...amobile.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        devicetree@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        linux-rockchip@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-pci@...r.kernel.org
Subject: Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller
 driver

On 3/15/23 07:54, Damien Le Moal wrote:
> On 3/14/23 23:53, Rick Wertenbroek wrote:
>> Hello Damien,
>> I also noticed random issues I suspect to be related to link status or power
>> state, in my case it sometimes happens that the BARs (0-6) in the config
>> space get reset to 0. This is not due to the driver because the driver never
>> ever accesses these registers (@0xfd80'0010 to 0xfd80'0024 TRM
>> 17.6.4.1.5-17.6.4.1.10).
>> I don't think the host rewrites them because lspci shows the BARs as
>> "[virtual]" which means they have been assigned by host but have 0
>> value in the endpoint device (when lspci rereads the PCI config header).
>> See https://github.com/pciutils/pciutils/blob/master/lspci.c#L422
>>
>> So I suspect the controller detects something related to link status or
>> power state and internally (in hardware) resets those registers. It's not
>> the kernel code, it never accesses these regs. The problem occurs
>> very randomly, sometimes in a few seconds, sometimes I cannot see
>> it for a whole day.
>>
>> Is this similar to what you are experiencing ?
> 
> Yes. I sometimes get NMIs after starting the function driver, when my function
> driver starts probing the bar registers after seeing the host changing one
> register. And the link also comes up with 4 lanes or 2 lanes, random.
> 
>> Do you have any idea as to what could make these registers to be reset
>> (I could not find anything in the TRM, also nothing in the driver seems to
>> cause it).
> 
> My thinking is that since we do not have a linkup notifier, the function driver
> starts setting things up without the link established (e.g. when the host is
> still powered down). Once the host start booting and pic link is established,
> things may be reset in the hardware... That is the only thing I can think of.
> 
> And yes, there are definitely something going on with the power states too I
> think: if I let things idle for a few minutes, everything stops working: no
> activity seen on the endpoint over the BARs. I tried enabling the sys and client
> interrupts to see if I can see power state changes, or if clearing the
> interrupts helps (they are masked by default), but no change. And booting the
> host with pci_aspm=off does not help either. Also tried to change all the
> capabilities related to link & power states to "off" (not supported), and no
> change either. So currently, I am out of ideas regarding that one.
> 
> I am trying to make progress on my endpoint driver (nvme function) to be sure it
> is not a bug there that breaks things. I may still have something bad because
> when I enable the BIOS native NVMe driver on the host, either the host does not
> boot, or grub crashes with memory corruptions. Overall, not yet very stable and
> still trying to sort out the root cause of that.

By the way, enabling the interrupts to see the error notifications, I do see a
lot of retry timeout and other recoverable errors. So the issues I am seeing
could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
... ?). Not sure. I do not have a PCI analyzer handy :)

I attached the patches I used to enable the EP interrupts. Enabling debug prints
will tell you what is going on. That may give you some hints on your setup ?

-- 
Damien Le Moal
Western Digital Research

View attachment "0001-arm-rk3399-rockpro64-Add-interrupts-to-PCI-endpoint-.patch" of type "text/x-patch" (1142 bytes)

View attachment "0001-PCI-rockchip-Add-client-and-subsys-interrupts-for-en.patch" of type "text/x-patch" (6301 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ