[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hLYkeStuZqUsjSphXmBovAxCvvzx4JJJw=3AmDYjdCtQ@mail.gmail.com>
Date: Wed, 10 Sep 2025 12:47:46 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>
Cc: rafael@...nel.org, christian.loehle@....com, daniel.lezcano@...aro.org,
gautam@...ux.ibm.com, linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/1] cpuidle: menu: Add residency threshold for
non-polling state selection
Please change the subject of the patch to something like "cpuidle:
menu: Use residency threshold in polling state override decisions"
which more precisely reflects the patch purpose IMV.
On Mon, Sep 8, 2025 at 9:54 AM Aboorva Devarajan <aboorvad@...ux.ibm.com> wrote:
>
> On virtualized PowerPC (pseries) systems, where only one polling state
> (Snooze) and one deep state (CEDE) are available, selecting CEDE when
> the predicted idle duration exceeds the target residency of the CEDE
If the target residency is exceeded by the predicted idle duration, it
should be fine to select the given state.
Did you really mean "less than" here? That would be consistent with
the code change.
> state can hurt performance. In such cases, the entry/exit overhead of
> CEDE outweighs the power savings, leading to unnecessary state transitions
> and higher latency.
>
> Menu governor currently contains a special-case rule that prioritizes
> the first non-polling state over polling, even when its target residency
> is much longer than the predicted idle duration. On PowerPC/pseries,
> where the gap between the polling state (Snooze) and the first non-polling
> state (CEDE) is large, this behavior causes performance regressions.
>
> This patch refines the special case by adding an extra requirement:
> the first non-polling state may only be chosen if its
> target_residency_ns is below the defined RESIDENCY_THRESHOLD_NS. If this
> condition is not met, the non-polling state is not selected, and polling
> state is retained instead.
>
> This change is limited to the single special-case condition for the first
> non-polling state. The general state selection logic in the menu governor
> remains unchanged.
>
> Performance improvement observed with pgbench on PowerPC (pseries)
> system:
> +---------------------------+------------+------------+------------+
> | Metric | Baseline | Patched | Change (%) |
> +---------------------------+------------+------------+------------+
> | Transactions/sec (TPS) | 495,210 | 536,982 | +8.45% |
> | Avg latency (ms) | 0.163 | 0.150 | -7.98% |
> +---------------------------+------------+------------+------------+
> CPUIdle state usage:
> +--------------+--------------+-------------+
> | Metric | Baseline | Patched |
> +--------------+--------------+-------------+
> | Total usage | 12,735,820 | 13,918,442 |
> | Above usage | 11,401,520 | 1,598,210 |
> | Below usage | 20,145 | 702,395 |
> +--------------+--------------+-------------+
>
> Above/Total and Below/Total usage percentages which indicates
> mispredictions:
> +------------------------+-----------+---------+
> | Metric | Baseline | Patched |
> +------------------------+-----------+---------+
> | Above % (Above/Total) | 89.56% | 11.49% |
> | Below % (Below/Total) | 0.16% | 5.05% |
> | Total cpuidle miss (%) | 89.72% | 16.54% |
> +------------------------+-----------+---------+
>
> The results show that restricting non-polling state selection to
> cases where its residency is within the threshold reduces mispredictions,
> lowers unnecessary state transitions, and improves overall throughput.
>
> Signed-off-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>
> ---
>
> v2: https://lore.kernel.org/all/20250317060357.29451-1-aboorvad@linux.ibm.com/
>
> Changes in v2 -> v3:
> - Modifed the patch following Rafael's feedback, incorporated a residency threshold check
> (s->target_residency_ns < RESIDENCY_THRESHOLD_NS) as suggested.
> - Updated commit message accordingly.
> ---
> drivers/cpuidle/governors/menu.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index b2e3d0b0a116..d25b04539109 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -316,11 +316,13 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>
> if (s->target_residency_ns > predicted_ns) {
> /*
> - * Use a physical idle state, not busy polling, unless
> - * a timer is going to trigger soon enough.
> + * Use a physical idle state instead of busy polling
> + * if the next timer doesn't expire soon and its
> + * target residency is below the residency threshold.
I would rephrase this somewhat, like this:
* Use a physical idle state instead of busy polling so long as
* its target residency is below the residency threshold and the
* next timer doesn't expire soon.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> - s->target_residency_ns <= data->next_timer_ns) {
> + s->target_residency_ns <= data->next_timer_ns &&
> + s->target_residency_ns < RESIDENCY_THRESHOLD_NS) {
And maybe adjust the checks ordering here.
The point is that on the example platform in question
s->target_residency_ns is always above RESIDENCY_THRESHOLD_NS, so it
is never really necessary to check data->next_timer_ns in which case
the HW should be able to optimize this.
> predicted_ns = s->target_residency_ns;
> idx = i;
> break;
> --
Powered by blists - more mailing lists