[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOSf1CHo66dxmChrx97+tfKSE=JM_NzrgdUF_Y4kFabnu3qotQ@mail.gmail.com>
Date: Wed, 7 Aug 2024 22:29:35 +1000
From: "Oliver O'Halloran" <oohall@...il.com>
To: "Maciej W. Rozycki" <macro@...am.me.uk>
Cc: Matthew W Carlis <mattc@...estorage.com>, Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
linux-pci@...r.kernel.org, mahesh@...ux.ibm.com, edumazet@...gle.com,
sr@...x.de, leon@...nel.org, linux-rdma@...r.kernel.org, helgaas@...nel.org,
kuba@...nel.org, pabeni@...hat.com, Jim Wilson <wilson@...iptree.org>,
linuxppc-dev@...ts.ozlabs.org, npiggin@...il.com, alex.williamson@...hat.com,
Bjorn Helgaas <bhelgaas@...gle.com>, mika.westerberg@...ux.intel.com,
david.abdurachmanov@...il.com, saeedm@...dia.com,
linux-kernel@...r.kernel.org, lukas@...ner.de, netdev@...r.kernel.org,
pali@...nel.org, "David S. Miller" <davem@...emloft.net>
Subject: Re: PCI: Work around PCIe link training failures
On Wed, Aug 7, 2024 at 9:14 PM Maciej W. Rozycki <macro@...am.me.uk> wrote:
>
> On Wed, 7 Aug 2024, Matthew W Carlis wrote:
>
> > > it does seem like this series made wASMedia ASM2824 work better but
> > > caused regressions elsewhere, so maybe we just need to accept that
> > > ASM2824 is slightly broken and doesn't work as well as it should.
> >
> > One of my colleagues challenged me to provide a more concrete example
> > where the change will cause problems. One such configuration would be not
> > implementing the Power Controller Control in the Slot Capabilities Register.
> > Then, Powering off the slot via out-of-band interfaces would result in the
> > kernel forcing the DSP to Gen1 100% of the time as far as I can tell.
> > The aspect of this force to Gen1 that is the most concerning to my team is
> > that it isn't cleaned up even if we replaced the EP with some other EP.
>
> Why does that happen?
>
> For the quirk to trigger, the link has to be down and there has to be the
> LBMS Link Status bit set from link management events as per the PCIe spec
> while the link was previously up, and then both of that while rescanning
> the PCIe device in question, so there's a lot of conditions to meet. Is
> it the case that in your setup there is no device at this point, but one
> gets plugged in later?
My read was that Matt is essentially doing a surprise hot-unplug by
removing power to the card without notifying the OS. I thought the
LBMS bit wouldn't be set in that case since the link goes down rather
than changes speed, but the spec is a little vague and that appears to
be happening in Matt's testing. It might be worth disabling the
workaround if the port has the surprise hotplug capability bit set.
It's fairly common for ports on NVMe drive backplanes to have it set
and a lot of people would be unhappy about those being forced to Gen 1
by accident.
Powered by blists - more mailing lists