[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAd53p75d=ibfFRCLmYOMvfrn7XbDajby1shKdWQWW=DOrX3uw@mail.gmail.com>
Date: Fri, 23 Jul 2021 15:05:12 +0800
From: Kai-Heng Feng <kai.heng.feng@...onical.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>, Joerg Roedel <jroedel@...e.de>,
"open list:PCI ENHANCED ERROR HANDLING (EEH) FOR POWERPC"
<linuxppc-dev@...ts.ozlabs.org>,
"open list:PCI SUBSYSTEM" <linux-pci@...r.kernel.org>,
open list <linux-kernel@...r.kernel.org>,
Lalithambika Krishnakumar <lalithambika.krishnakumar@...el.com>,
Alex Williamson <alex.williamson@...hat.com>,
"Oliver O'Halloran" <oohall@...il.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Mika Westerberg <mika.westerberg@...ux.intel.com>,
Lu Baolu <baolu.lu@...ux.intel.com>
Subject: Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
On Fri, Jul 23, 2021 at 1:24 PM Christoph Hellwig <hch@...radead.org> wrote:
>
> On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> > Marking both of these as "not applicable" for now because I don't
> > think we really understand what's going on.
> >
> > Apparently a DMA occurs during suspend or resume and triggers an ACS
> > violation. I don't think think such a DMA should occur in the first
> > place.
> >
> > Or maybe, since you say the problem happens right after ACS is enabled
> > during resume, we're doing the ACS enable incorrectly? Although I
> > would think we should not be doing DMA at the same time we're enabling
> > ACS, either.
> >
> > If this really is a system firmware issue, both HP and Dell should
> > have the knowledge and equipment to figure out what's going on.
>
> DMA on resume sounds really odd. OTOH the below mentioned case of
> a DMA during suspend seems very like in some setup. NVMe has the
> concept of a host memory buffer (HMB) that allows the PCIe device
> to use arbitrary host memory for internal purposes. Combine this
> with the "Storage D3" misfeature in modern x86 platforms that force
> a slot into d3cold without consulting the driver first and you'd see
> symptoms like this. Another case would be the NVMe equivalent of the
> AER which could lead to a completion without host activity.
The issue can also be observed on non-HMB NVMe.
>
> We now have quirks in the ACPI layer and NVMe to fully shut down the
> NVMe controllers on these messed up systems with the "Storage D3"
> misfeature which should avoid such "spurious" DMAs at the cost of
> wearning out the device much faster.
Since the issue is on S3, I think the NVMe always fully shuts down.
Kai-Heng
Powered by blists - more mailing lists