linux-kernel - Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPDyKFr0-yh8wt169QanAo3AmuXBq_9p3xiiqeFmmWz-ntNQsw@mail.gmail.com>
Date: Tue, 23 Sep 2025 11:42:03 +0200
From: Ulf Hansson <ulf.hansson@...aro.org>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Kevin Hilman <khilman@...libre.com>, linux-pm@...r.kernel.org, 
	Pavel Machek <pavel@...nel.org>, Len Brown <len.brown@...el.com>, 
	Daniel Lezcano <daniel.lezcano@...aro.org>, Saravana Kannan <saravanak@...gle.com>, 
	Maulik Shah <quic_mkshah@...cinc.com>, Prasad Sodagudi <psodagud@...cinc.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

On Mon, 22 Sept 2025 at 20:55, Rafael J. Wysocki <rafael@...nel.org> wrote:
>
> On Thu, Sep 18, 2025 at 5:34 PM Ulf Hansson <ulf.hansson@...aro.org> wrote:
> >
> > On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@...nel.org> wrote:
> > >
> > > Hi,
> > >
> > > Sorry for the delay.
> > >
> > > On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@...aro.org> wrote:
> > > >
> > > > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@...aro.org> wrote:
> > > > >
> > > > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@...nel.org> wrote:
> > > > > >
> > > > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@...libre.com> wrote:
> > > > > > >
> > > > > > > "Rafael J. Wysocki" <rafael@...nel.org> writes:
> > > > > > >
> > > > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@...aro.org> wrote:
> > > > > > > >>
> > > > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > > > >> becomes selected.
> > > > > > > >>
> > > > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > > > >> states during system-wide suspend.
> > > > > > > >>
> > > > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > > > >> into account the QoS limit.
> > > > > > > >
> > > > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > > > >
> > > > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > > > indicate that the given request should also be respected by
> > > > > > > > suspend-to-idle?
> > > > > > > >
> > > > > > > > There are two basic reasons why I think so:
> > > > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > > > don't see a reason why it wouldn't be the case.
> > > > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > > > the existing one.
> > > > > > >
> > > > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > > > >
> > > > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > > > s2idle to not respect this value?
> > > > > >
> > > > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > > > does not take any QoS constraints into account, so users may want to
> > > > > > set QoS limits for run-time and then suspend with the expectation that
> > > > > > QoS will not affect it.
> > > > >
> > > > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > > > different wakeup latency constraints.
> > > > >
> > > > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > > > allow this to be managed, I think.
> > > > >
> > > > > Although, I am not fully convinced yet that re-using
> > > > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > > > the QoS constraint to be system-wide for any type of device. For
> > > > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > > > eMMC), as a way to understand what low power state to pick as system
> > > > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > > > other PM-domains too.
> > > > >
> > > > > Interested to hear your thoughts around this.
> > > >
> > > > Hey, just wanted to see if you have managed to digest this and have
> > > > any possible further comment?
> > >
> > > The reason why I thought about reusing /dev/cpu_dma_latency is because
> > > I think that the s2idle limit should also apply to cpuidle.  Of
> > > course, cpuidle may be limited further, but IMV it should observe the
> > > limit set on system suspend (it would be kind of inconsistent to allow
> > > cpuidle to use deeper idle states than can be used by s2idle).
> >
> > Agreed!
> >
> > >
> > > I also don't think that having a per-CPU s2idle limit would be
> > > particularly useful (and it might be problematic).
> > >
> > > Now, it is not as straightforward as I thought because someone may
> > > want to set a more restrictive limit on cpuidle, in which case they
> > > would need to open the same special device file twice etc and that
> > > would be quite cumbersome.
> > >
> > > So in the end I think that what you did in the $subject patch is
> > > better, but I still would like it to also affect cpuidle.
> >
> > Okay. I will update the patches according to your suggestions!
> >
> > >
> > > And it needs to be made clear that this is a limit on the resume
> > > latency of one device.  Worst case, the system wakeup latency may be a
> > > sum of those limits if the devices in question are resumed
> > > sequentially, so in fact this is a limit on the contribution of a
> > > given device to the system wakeup latency.
> >
> > Indeed, that's a very good point! I will keep this in mind when
> > working on adding the documentation part.
>
> Well, this also means that using one limit for all of the different
> devices is not likely to be very practical because the goal is to save
> as much energy as reasonably possible in system suspend while
> respecting a global resume latency constraint at the same time.
>
> Using the same limit on a local contribution from each device to the
> combined latency is not likely to be effective here.  Rather, I'd
> expect that the best results can be achieved by setting different
> resume latency limits on different devices, depending on how much
> power they draw in each of their idle states and what the exit latency
> values for all of those states are.  In other words, this appears to
> be an optimization problem in which the resume latency limits for
> individual devices need to be chosen to satisfy the global resume
> latency constraint and minimize the total system power.

I am following your reasoning and I agree!

Perhaps we should start with extending the cpu_dma_latency with an
ioctl after all? This would allow userspace to specify constraints to
be applicable for system-wide-suspend (s2idle), but it would still be
limited for CPUs/CPU-clusters.

For other devices, we should probably explore the per device PM QoS
(pm_qos_latency_tolerance_us) instead. Currently the
pm_qos_latency_tolerance_us is used for "runtime_suspend", so perhaps
adding another per device sysfs file, like
"pm_qos_system_wakeup_latency_us",  that we can use for the
system-wide-wakeup latency constraint?

Would this make better sense, you think?

Kind regards
Uffe