[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0grRdTYaamSnKSF-HyMmCHDEZ4haLo+ziSBxhDg1PbjRQ@mail.gmail.com>
Date: Mon, 16 Dec 2024 20:34:24 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Christoph Hellwig <hch@....de>, Ulf Hansson <ulf.hansson@...aro.org>,
"Rafael J. Wysocki" <rjw@...ysocki.net>, Bjorn Helgaas <helgaas@...nel.org>, kbusch@...nel.org,
axboe@...nel.dk, sagi@...mberg.me, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org, andersson@...nel.org,
konradybcio@...nel.org, Len Brown <len.brown@...el.com>, linux-pm@...r.kernel.org
Subject: Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
On Mon, Dec 16, 2024 at 6:52 PM Manivannan Sadhasivam
<manivannan.sadhasivam@...aro.org> wrote:
>
> On Mon, Dec 16, 2024 at 06:35:52PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 16, 2024 at 6:11 PM Manivannan Sadhasivam
> > <manivannan.sadhasivam@...aro.org> wrote:
> > >
> > > On Mon, Dec 16, 2024 at 05:24:40PM +0100, Rafael J. Wysocki wrote:
> > > > On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
> > > > <manivannan.sadhasivam@...aro.org> wrote:
> > > > >
> > > > > On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > > > > > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@....de> wrote:
> > > > > > >
> > > > > > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > > > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > > > > > ACPI is controlling the low power state for all the devices. Based on
> > > > > > > > the requested system wide low power state, ACPI can then decide to
> > > > > > > > call pm_set_suspend_via_firmware() or not.
> > > > > > > >
> > > > > > > > Still there is a problem with this for ACPI too.
> > > > > > > >
> > > > > > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > > > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > > > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > > > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > > > > > the NMVe storage will certainly contribute to draining the battery,
> > > > > > > > especially when I won't be using my laptop for a couple of days.
> > > > > > > >
> > > > > > > > In my opinion, we need a better approach that is both flexible and
> > > > > > > > that dynamically adjusts based upon the use case.
> > > > > > >
> > > > > > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > > > > > but I don't really know enough about the PM core to drive it
> > > > > > > (as the reply from Rafael to my mail makes pretty clear :))
> > > > > >
> > > > > > I'm here to help.
> > > > > >
> > > > > > Let me know what exactly you want to achieve and we'll see how to make it work.
> > > > >
> > > > > I'll try to summarize the requirement here since I started this thread:
> > > > >
> > > > > Problem statement
> > > > > =================
> > > > >
> > > > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > > > applicable to other devices as well.
> > > > >
> > > > > Drivers are relying on couple of options now:
> > > > >
> > > > > 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> > > > > device assuming that the firmware is going to handle the suspend. But this API
> > > > > is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> > > > > by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> > > > > is S2R. But if the platform doesn't support S2R (current case of most of the
> > > > > Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> > > > > powered down draining the battery.
> > > >
> > > > So my question here would be why is it not powered down always during
> > > > system-wide suspend?
> > > >
> > > > Why exactly is it necessary to distinguish one case from the other
> > > > (assuming that we are talking about system-wide suspend only)?
> > > >
> > >
> > > To support Android like usecase with firmware that only supports
> > > suspend-to-idle (Qcom platforms). This usecase is not applicable right now, but
> > > one can't just rule out the possibility in the near future.
> >
> > This doesn't explain anything to me, sorry.
> >
> > > And the problem is that we need to support both Android and non-Android systems
> > > with same firmware :/
> >
> > So what technically is the problem?
> >
>
> NVMe wear out is the problem.
>
> > > > There are drivers that use pm_suspend_via_firmware() to check whether
> > > > or not something special needs to be done to the device because if
> > > > "false" is returned, the platform firmware is not going to remove
> > > > power from it.
> > > >
> > > > However, you seem to be talking about the opposite, so doing something
> > > > special to the device if "true" is returned. I'm not sure why this is
> > > > necessary.
> > > >
> > >
> > > Because, since 'false' is returned, drivers like NVMe are assuming that the
> > > platform won't remove power on all DT systems and they just keep the devices in
> > > low power state (not powering them down). But why would I want my NVMe in DT
> > > based laptop to be always powered in system suspend?
> >
> > Because it causes the system to use less energy when suspended.
> >
>
> But the NVMe driver would be still operational. Powering it down would cause the
> system to use much less energy.
Yes, that's what I wanted to say, sorry for the confusion.
IIRC, there are two reasons why the NVMe driver does this:
(1) On some platforms the devices handled by it couldn't come back
from D3cold without platform firmware's help.
(2) Putting devices into D3hot alone was not sufficient to reduce
their power sufficiently and it didn't make much difference on top of
the other things done to them for this purpose.
It avoids putting devices into D3cold (or even into D3hot for that
matter) because of the above, not because it is generally required to
do so in suspend-to-idle. So this really is exceptional behavior and
not following a rule of some kind.
> There is also a case where some devices like
> (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> powered down in order for the SoC to reach its low power state (CX power
> collapse in Qcom terms). If not, the SoC would continue to draw more power
> leading to battery draining quickly. This platform is supported in upstream and
> we keep the PCIe interconnect voted during suspend as the NVMe driver is
> expecting the device to retain its state during resume. Because of this
> requirement, this platform is not reaching SoC level low power state with
> upstream kernel.
OK, now all of this makes sense and that's why you really want NVMe
devices to end up in some form of PCI D3 in suspend-to-idle.
Would D3hot be sufficient for this platform or does it need to be
D3cold? If the latter, what exactly is the method by which they are
put into D3cold?
> > > > > If the platform is using DT, then there is no entity setting
> > > > > pm_set_suspend_via_firmware().
> > > >
> > > > That's true and so the assumption is that in this case the handling of
> > > > all devices will always be the same regardless of which flavor of
> > > > system suspend is chosen by user space.
> > > >
> > >
> > > Right and that's why the above concern is raised.
> >
> > And it is still unclear to me what the problem with it is.
> >
> > What exactly can go wrong?
> >
> > > > > So NVMe will be kept in low power state all the
> > > > > time (still draining the battery).
> > > >
> > > > So what would be the problem with powering it down unconditionally?
> > > >
> > >
> > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > will wear out the NVMe devices if used in Android tablets etc...
> >
> > I understand the wear-out concern.
> >
> > Is there anything else?
> >
>
> No, that's the only concern.
OK
I think we're getting to the bottom of the issue.
First off, there really is no requirement to avoid putting PCI devices
into D3hot or D3cold during suspend-to-idle. On all modern Intel
platforms, PCIe devices are put into D3(hot or cold) during
suspend-to-idle and I don't see why this should be any different on
platforms from any other vendors.
The NVMe driver is an exception and it avoids D3(hot or cold) during
suspend-to-idle because of problems with some hardware or platforms.
It might in principle allow devices to go into D3(hot or cold) during
suspend-to-idle, so long as it knows that this is going to work.
However, there is an additional concern that putting an NVMe device
into D3cold every time during system suspend on Android might cause it
to wear out more quickly.
Is there anything else?
Powered by blists - more mailing lists