[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200709163455.GA23821@mellanox.com>
Date: Thu, 9 Jul 2020 13:34:55 -0300
From: Jason Gunthorpe <jgg@...lanox.com>
To: Dan Williams <dan.j.williams@...el.com>
Cc: Christoph Hellwig <hch@...radead.org>,
linux-nvdimm <linux-nvdimm@...ts.01.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Doug Ledford <dledford@...hat.com>,
Pavel Machek <pavel@....cz>, Len Brown <len.brown@...el.com>,
Linux ACPI <linux-acpi@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 11/12] PM, libnvdimm: Add 'mem-quiet' state and
callback for firmware activation
On Thu, Jul 09, 2020 at 09:10:06AM -0700, Dan Williams wrote:
> On Thu, Jul 9, 2020 at 8:39 AM Jason Gunthorpe <jgg@...lanox.com> wrote:
> >
> > On Thu, Jul 09, 2020 at 04:00:51PM +0100, Christoph Hellwig wrote:
> > > On Mon, Jul 06, 2020 at 06:59:32PM -0700, Dan Williams wrote:
> > > > The runtime firmware activation capability of Intel NVDIMM devices
> > > > requires memory transactions to be disabled for 100s of microseconds.
> > > > This timeout is large enough to cause in-flight DMA to fail and other
> > > > application detectable timeouts. Arrange for firmware activation to be
> > > > executed while the system is "quiesced", all processes and device-DMA
> > > > frozen.
> > > >
> > > > It is already required that invoking device ->freeze() callbacks is
> > > > sufficient to cease DMA. A device that continues memory writes outside
> > > > of user-direction violates expectations of the PM core to be to
> > > > establish a coherent hibernation image.
> > > >
> > > > That said, RDMA devices are an example of a device that access memory
> > > > outside of user process direction.
> >
> > Are you saying freeze doesn't work for some RDMA drivers? That would
> > be a driver bug, I think.
>
> Right, it's more my hunch than a known bug at this point, but in my
> experience with testing server class hardware when I've reported a
> power management bugs I've sometimes got the incredulous response "who
> suspends / hibernates servers!?". I can drop that comment.
>
> Are there protocol timeouts that might need to be adjusted for a 100s
> of microseconds blip in memory controller response?
Survivability depends alot on HW support, it has to suspend, not
discard DMAs that it needs to issue. Most likely things are as you
say, and HW doesn't support safe short time suspend. The usual use of
PM stuff here is to make the machine ready for kexec
Jason
Powered by blists - more mailing lists