[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2f4rkilknak7xj3lyv7jx2yi7qyuw72pjwy67tkouqcvjmyn7l@scmdo34xnkuj>
Date: Tue, 26 Nov 2024 11:11:38 -0600
From: Bjorn Andersson <andersson@...nel.org>
To: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>, kbusch@...nel.org, axboe@...nel.dk,
hch@....de, sagi@...mberg.me, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org, konradybcio@...nel.org,
"Rafael J. Wysocki" <rjw@...ysocki.net>, Ulf Hansson <ulf.hansson@...aro.org>
Subject: Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by
the user
On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> + Ulf (also interested in this topic)
>
> On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > [+cc Rafael]
> >
> > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > PCI core allows users to configure the D3Cold state for each PCI device
> > > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> > >
> > > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > > during system suspend if the user has allowed D3Cold for the device.
> > > Existing checks in the NVMe driver decide whether to shut down the device
> > > (based on platform/device limitations), so use this flag as the last resort
> > > to keep the existing behavior.
> > >
> > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> > > D3Cold and the users can disallow it through sysfs if they want.
> >
> > What problem does this solve? I guess there must be a case where
> > suspend leaves NVMe in a higher power state than you want?
> >
>
> Yeah, this is the case for all systems that doesn't fit into the existing checks
> in the NVMe suspend path:
>
> 1. ACPI based platforms
> 2. Controller doesn't support NPSS (hardware issue/limitation)
> 3. ASPM not enabled
> 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
>
> In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above
> checks. Hence, they were not fully powered down during system suspend and always
> in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm
> specific SoC powered down state that consumes just enough power to wake up the
> SoC. Since the controller driver keeps the PCI resource vote because of NVMe,
> the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low
> power state.
>
Per our previous discussions, I think we concluded that we have two
cases:
a) of_machine_is_compatible("qcom,sc8280xp")
b) Everything else
In #a we have to power down NVMe as the link doesn't survive the CX
collapse, is this your #2?. For #b it's primarily a policy decision
about the tradeoff between battery and flash life (and in some cases, as
already seen in the nvme driver, some device-specific policy).
> > What does it mean that this is the "last resort to keep the existing
> > behavior"? All the checks are OR'd together and none have side
> > effects, so the order doesn't really matter. It changes the existing
> > behavior *unless* the user has explicitly cleared d3cold_allowed via
> > sysfs.
> >
>
> Since the checks are ORed, this new check is not going to change the existing
> behavior for systems satisfying above checks i.e., even if the user changes the
> flag to forbid D3Cold, it won't affect them and it *shoudn't*.
>
> > pdev->d3cold_allowed is set by default, so I guess this change means
> > that unless the user clears d3cold_allowed, we let the PCI core decide
> > the suspend state instead of managing it directly in nvme_suspend()?
> >
>
> If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and
> the PCI controller driver can turn off all bus specific resources. Otherwise,
> NVMe driver will put the device into low power mode and the controller driver
> has to do something similar to retain the device power.
>
> > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
> > > ---
> > > drivers/nvme/host/pci.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 4b9fda0b1d9a..a4d4687854bf 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> > > */
> > > if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > !pcie_aspm_enabled(pdev) ||
> > > - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> > > + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> > > + pdev->d3cold_allowed)
> > > return nvme_disable_prepare_reset(ndev, true);
> >
> > I guess your intent is that if users prevent use of D3cold via sysfs,
> > we'll use the NVMe-specific power states, and otherwise, the PCI core
> > will decide?
> >
>
> Not PCI core, but the controller drivers actually which takes care of powering
> down the PCI bus resources.
>
> > I think returning 0 here means the PCI core decides what state to use
> > in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This
> > could be any state from D0 to D3cold depending on platform support and
> > wakeup considerations (see pci_target_state()).
> >
> > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > your underlying intent. It suggests that you're really hoping for
> > D3cold, but that's only a possibility if firmware supports it, and we
> > have no visibility into that here.
> >
>
> I'm not relying on firmware to do anything here. If firmware has to decide the
> suspend state, it should already satisfy the pm_suspend_via_firmware() check in
> nvme_suspend(). Here, the controller driver takes care of putting the device
> into D3Cold. Currently, the controller drivers cannot do it (on DT platforms)
> because of NVMe driver's behavior.
>
> > It also seems contrary to the earlier comment that suggests we prefer
> > host managed nvme power settings:
> >
> > * The platform does not remove power for a kernel managed suspend so
> > * use host managed nvme power settings for lowest idle power if
> > * possible. This should have quicker resume latency than a full device
> > * shutdown. But if the firmware is involved after the suspend or the
> > * device does not support any non-default power states, shut down the
> > * device fully.
>
> This above comment satisfies the ACPI platforms as the firmware controls the
> suspend state. On DT platforms, even though the firmware takes care of suspend
> state, it still relies on the controller driver to relinquish the votes for PCI
> resources. Only then, the firmware will put the whole SoC in power down mode
> a.k.a CX power collapse mode in Qcom SoCs.
>
> We did attempt so solve this problem in multiple ways, but the lesson learned
> was, kernel cannot decide the power mode without help from userspace. That's the
> reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow
> userspace to override the D3Cold if it wants based on platform requirement.
>
What do you mean "without help from userspace". Which entity in
userspace is going to help make this decision?
> This is similar to how UFS allows users to configure power states of both the
> device and controller:
>
"Allow users to configure" is an optimization, is the purpose of this
patch to allow similar kind of optimization? Or is it supposed to allow
userspace to help solve case #a above (hardware doesn't survive CX
collapse)?
Regards,
Bjorn
> /sys/bus/platform/drivers/ufshcd/*/spm_lvl
> /sys/bus/platform/devices/*.ufs/spm_lvl
>
> If the 'd3cold_allowed' attribute is not a good fit for this usecase, then I'd
> like to introduce a new attribute similar to UFS.
>
> - Mani
>
> --
> மணிவண்ணன் சதாசிவம்
Powered by blists - more mailing lists