[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hCvvMzPCy+8N34vCt_d4mdeuMVHR9dQbs4cHwQDOpYrw@mail.gmail.com>
Date: Tue, 21 Oct 2025 21:25:22 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>
Cc: rafael@...nel.org, christian.loehle@....com, daniel.lezcano@...aro.org,
gautam@...ux.ibm.com, linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] cpuidle: menu: Use residency threshold in polling
state override decisions
On Mon, Oct 6, 2025 at 3:40 AM Aboorva Devarajan <aboorvad@...ux.ibm.com> wrote:
>
> On virtualized PowerPC (pseries) systems, where only one polling state
> (Snooze) and one deep state (CEDE) are available, selecting CEDE when
> the predicted idle duration is less than the target residency of CEDE
> state can hurt performance. In such cases, the entry/exit overhead of
> CEDE outweighs the power savings, leading to unnecessary state
> transitions and higher latency.
>
> Menu governor currently contains a special-case rule that prioritizes
> the first non-polling state over polling, even when its target residency
> is much longer than the predicted idle duration. On PowerPC/pseries,
> where the gap between the polling state (Snooze) and the first non-polling
> state (CEDE) is large, this behavior causes performance regressions.
>
> This patch refines the special case by adding an extra requirement:
> the first non-polling state can only be chosen if its
> target residency is below the defined RESIDENCY_THRESHOLD_NS. If
> this condition is not satisfied, polling is allowed instead, avoiding
> suboptimal non-polling state entries.
>
> This change is limited to the single special-case rule for the first
> non-polling state. The general non-polling state selection logic in the
> menu governor remains unchanged.
>
> Performance improvement observed with pgbench on PowerPC (pseries)
> system:
> +---------------------------+------------+------------+------------+
> | Metric | Baseline | Patched | Change (%) |
> +---------------------------+------------+------------+------------+
> | Transactions/sec (TPS) | 495,210 | 536,982 | +8.45% |
> | Avg latency (ms) | 0.163 | 0.150 | -7.98% |
> +---------------------------+------------+------------+------------+
> CPUIdle state usage:
> +--------------+--------------+-------------+
> | Metric | Baseline | Patched |
> +--------------+--------------+-------------+
> | Total usage | 12,735,820 | 13,918,442 |
> | Above usage | 11,401,520 | 1,598,210 |
> | Below usage | 20,145 | 702,395 |
> +--------------+--------------+-------------+
>
> Above/Total and Below/Total usage percentages:
> +------------------------+-----------+---------+
> | Metric | Baseline | Patched |
> +------------------------+-----------+---------+
> | Above % (Above/Total) | 89.56% | 11.49% |
> | Below % (Below/Total) | 0.16% | 5.05% |
> | Total cpuidle miss (%) | 89.72% | 16.54% |
> +------------------------+-----------+---------+
>
> The results indicate that restricting CEDE selection to cases where
> its residency matches the predicted idle time reduces mispredictions,
> lowers unnecessary state transitions, and improves overall throughput.
>
> Reviewed-by: Christian Loehle <christian.loehle@....com>
> Signed-off-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>
> ---
>
> v3: https://lore.kernel.org/all/20250908075443.208570-1-aboorvad@linux.ibm.com/
>
> v3 -> v4
>
> - Rebased onto the linux-pm/pm branch.
> - Updated commit message and comments based on review feedback.
> - Reordered condition checks as recommended in review.
> - Added Reviewed-by tag from Christian.
>
> ---
> drivers/cpuidle/governors/menu.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 4d9aa5ce31f0..6a98a724442e 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -320,10 +320,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> }
>
> /*
> - * Use a physical idle state, not busy polling, unless a timer
> - * is going to trigger soon enough.
> + * Use a physical idle state instead of busy polling as long as
> + * its target residency is below the residency threshold and the
> + * next timer doesn't expire soon.
> */
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> + s->target_residency_ns < RESIDENCY_THRESHOLD_NS &&
> s->target_residency_ns <= data->next_timer_ns) {
> predicted_ns = s->target_residency_ns;
> idx = i;
> --
Applied as 6.19 material, thanks!
Powered by blists - more mailing lists