[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ceb2759f3bdcebc4be99d8967b9292a9b85fadc7.camel@linux.ibm.com>
Date: Mon, 14 Apr 2025 11:31:00 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>,
Christian Loehle
<christian.loehle@....com>
Cc: Daniel Lezcano <daniel.lezcano@...aro.org>,
Gautam Menghani
<gautam@...ux.ibm.com>, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] cpuidle: menu: Prefer polling state for short
idle durations
On Tue, 2025-04-01 at 18:58 +0200, Rafael J. Wysocki wrote:
> On Mon, Mar 17, 2025 at 9:35 AM Christian Loehle
> <christian.loehle@....com> wrote:
> >
> > On 3/17/25 06:03, Aboorva Devarajan wrote:
> > > Avoid selecting deep idle state when the predicted idle duration is
> > > shorter than its target residency, as this leads to unnecessary state
> > > transitions without energy savings.
> > >
> > > On virtualized PowerPC (pseries) systems, where only one polling state
> > > (Snooze) and one deep state (CEDE) are available, selecting CEDE when
> > > its target residency exceeds the predicted idle duration hurts
> > > performance.
> > >
> > > For example, if the predicted idle duration is 15 us and the first
> > > non-polling state has a target residency of 120 us, selecting it
> > > would be suboptimal.
> > >
> > > Remove the condition introduced in commit 69d25870f20c
> > > ("cpuidle: fix the menu governor to boost IO performance") that
> > > prioritized non-polling states even when their target residency
> > > exceeded the predicted idle duration and allow polling states to
> > > be selected when appropriate.
> > >
> > > Performance improvement observed with pgbench on PowerPC (pseries)
> > > system:
> > > +---------------------------+------------+------------+------------+
> > > > Metric | Baseline | Patched | Change (%) |
> > > +---------------------------+------------+------------+------------+
> > > > Transactions/sec (TPS) | 494,834 | 538,707 | +8.85% |
> > > > Avg latency (ms) | 0.162 | 0.149 | -8.02% |
> > > +---------------------------+------------+------------+------------+
> > >
> > > CPUIdle state usage:
> > > +--------------+--------------+-------------+
> > > > Metric | Baseline | Patched |
> > > +--------------+--------------+-------------+
> > > > Total usage | 12,703,630 | 13,941,966 |
> > > > Above usage | 11,388,990 | 1,620,474 |
> > > > Below usage | 19,973 | 684,708 |
> > > +--------------+--------------+-------------+
> > >
> > > Above/Total and Below/Total usage percentages:
> > > +------------------------+-----------+---------+
> > > > Metric | Baseline | Patched |
> > > +------------------------+-----------+---------+
> > > > Above % (Above/Total) | 89.67% | 11.63% |
> > > > Below % (Below/Total) | 0.16% | 4.91% |
> > > > Total cpuidle miss (%) | 89.83% | 16.54% |
> > > +------------------------+-----------+---------+
> > >
> > > Signed-off-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>
> > >
> > > ---
> > >
> > > v1: https://lore.kernel.org/all/20240809073120.250974-1-aboorvad@linux.ibm.com/
> > >
> > > v1 -> v2:
> > >
> > > - Drop cover letter and improve commit message.
> > > ---
> > > drivers/cpuidle/governors/menu.c | 11 -----------
> > > 1 file changed, 11 deletions(-)
> > >
> > > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> > > index 28363bfa3e4c..4b199377e4be 100644
> > > --- a/drivers/cpuidle/governors/menu.c
> > > +++ b/drivers/cpuidle/governors/menu.c
> > > @@ -296,17 +296,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > > idx = i; /* first enabled state */
> > >
> > > if (s->target_residency_ns > predicted_ns) {
> > > - /*
> > > - * Use a physical idle state, not busy polling, unless
> > > - * a timer is going to trigger soon enough.
> > > - */
> > > - if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> > > - s->exit_latency_ns <= latency_req &&
> > > - s->target_residency_ns <= data->next_timer_ns) {
> > > - predicted_ns = s->target_residency_ns;
> > > - idx = i;
> > > - break;
> > > - }
> > > if (predicted_ns < TICK_NSEC)
> > > break;
> > >
> >
> > I'm still fine with this and don't see a better way to solve the reported
> > issue generally, see the discussion on v1.
> > Rafael, do you have any objections?
>
> The behavior change on x86 would be rather unacceptable I'm afraid.
>
> As already stated in a different thread today, on x86 the polling
> state turns out to be overly shallow most of the time even before this
> patch.
>
> > We could make this conditional on there being a high latency difference between
> > polling and non-polling to keep x86 behavior.
>
> If I'm not mistaken, to address the issue at hand, it would be
> sufficient to add an "s->target_residency_ns < RESIDENCY_THRESHOLD_NS"
> condition to the if () statement removed by this patch, wouldn't it?
Hi Rafael,
Thanks for your response. Since the first physical state for the pseries system
(PowerVM) has a target residency of is 120 us, adding the condition
`s->target_residency_ns < RESIDENCY_THRESHOLD_NS` will indeed resolve the issue.
This change will ensure we only switch if the target residency of the physical polling
state is below 15 us which should be good.
Basic CPU Idle test: (1) carried out on a PowerVM (Power10) system.
Base Kernel:
CPU Wakee CPU Waker Sleep Interval (us) Total Usage Diff Total Time Diff (ns) Total Above Diff Total Below Diff Above % Below %
10 39 2.585 302245 463259 5 0 0.00 0.00
10 39 4.735 1023429 2449157 7 2 0.00 0.00
10 39 7.339 1321973 2542391 6 0 0.00 0.00
10 39 12.376 798039 5429695 5 0 0.00 0.00
10 39 22.373 443568 7677710 232439 0 52.40 0.00
10 39 32.472 306655 8514874 306552 0 99.97 0.00
10 39 42.385 235004 8800135 231795 1 98.63 0.00
10 39 52.461 190079 9029907 187694 1 98.75 0.00
10 39 62.473 159704 9182151 157656 0 98.72 0.00
10 39 72.442 137826 9282009 136055 1 98.72 0.00
10 39 82.462 120114 9308294 118640 1 98.77 0.00
10 39 92.464 106081 9278734 104725 2 98.72 0.00
10 39 102.477 96663 9422966 95415 2 98.71 0.00
10 39 112.413 88117 9459074 86972 1 98.70 0.00
10 39 122.465 80942 9497520 79064 10 97.68 0.01
10 39 132.430 75825 9544063 1035 913 1.36 1.20
10 39 142.442 70511 9561624 969 843 1.37 1.20
10 39 152.449 65929 9586042 936 762 1.42 1.16
10 39 202.425 49782 9636037 960 591 1.93 1.19
10 39 302.405 33770 9749277 415 392 1.23 1.16
10 39 402.443 25403 9701667 306 292 1.20 1.15
10 39 502.433 20751 9826491 245 236 1.18 1.14
10 39 602.451 17468 9832612 208 203 1.19 1.16
10 39 702.492 15135 9847269 193 168 1.28 1.11
10 39 802.501 13369 9856279 159 151 1.19 1.13
10 39 902.507 12007 9862544 147 138 1.22 1.15
10 39 1002.499 10906 9865485 132 135 1.21 1.24
With suggested change:
CPU Wakee CPU Waker Sleep Interval (us) Total Usage Diff Total Time Diff (ns) Total Above Diff Total Below Diff Above % Below %
10 39 2.682 247169 433222 1 0 0.00 0.00
10 39 4.803 1007178 2560799 6 0 0.00 0.00
10 39 7.357 1327004 2754768 6 1 0.00 0.00
10 39 12.362 799098 5556114 6 0 0.00 0.00
10 39 22.380 443901 7402339 6 2 0.00 0.00
10 39 32.388 307357 8190282 6 1 0.00 0.00
10 39 42.423 234908 8625490 6 0 0.00 0.00
10 39 52.425 190224 8882797 0 0 0.00 0.00
10 39 62.354 159019 9000032 6 0 0.00 0.00
10 39 72.380 137887 9182585 11 4 0.01 0.00
10 39 82.412 121129 9285328 2 1 0.00 0.00
10 39 92.398 107844 9326098 3 15 0.00 0.01
10 39 102.456 97320 9408519 3 1 0.00 0.00
10 39 112.358 88884 9468555 6 4 0.01 0.00
10 39 122.402 81630 9517007 6 42 0.01 0.05
10 39 132.359 111082 9528254 38190 37018 34.38 33.32
10 39 142.368 71612 9636036 1309 1318 1.83 1.84
10 39 152.466 66791 9663690 1152 1086 1.72 1.63
10 39 202.453 50412 9745668 999 707 1.98 1.40
10 39 302.454 34045 9828326 429 427 1.26 1.25
10 39 402.480 25851 9868703 319 328 1.23 1.27
10 39 502.407 20931 9877533 273 271 1.30 1.29
10 39 602.440 17602 9878219 224 225 1.27 1.28
10 39 702.465 15254 9915818 192 196 1.26 1.28
10 39 802.452 13462 9930026 161 158 1.20 1.17
10 39 902.461 12089 9933506 144 138 1.19 1.14
10 39 1002.483 10988 9942937 130 130 1.18 1.18
I'm a bit curious on why RESIDENCY_THRESHOLD_NS is hardcoded to 15 us? Would it make sense to make this value tunable instead?
I actually have a patch that makes it configurable via sysfs to include it along with this change if you think that would be useful.
Looking forward for your comments on this.
Thanks and regards,
Aboorva
[1]: https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
Powered by blists - more mailing lists