lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ceb2759f3bdcebc4be99d8967b9292a9b85fadc7.camel@linux.ibm.com>
Date: Mon, 14 Apr 2025 11:31:00 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>,
        Christian Loehle
	 <christian.loehle@....com>
Cc: Daniel Lezcano <daniel.lezcano@...aro.org>,
        Gautam Menghani	
 <gautam@...ux.ibm.com>, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] cpuidle: menu: Prefer polling state for short
 idle durations

On Tue, 2025-04-01 at 18:58 +0200, Rafael J. Wysocki wrote:
> On Mon, Mar 17, 2025 at 9:35 AM Christian Loehle
> <christian.loehle@....com> wrote:
> > 
> > On 3/17/25 06:03, Aboorva Devarajan wrote:
> > > Avoid selecting deep idle state when the predicted idle duration is
> > > shorter than its target residency, as this leads to unnecessary state
> > > transitions without energy savings.
> > > 
> > > On virtualized PowerPC (pseries) systems, where only one polling state
> > > (Snooze) and one deep state (CEDE) are available, selecting CEDE when
> > > its target residency exceeds the predicted idle duration hurts
> > > performance.
> > > 
> > > For example, if the predicted idle duration is 15 us and the first
> > > non-polling state has a target residency of 120 us, selecting it
> > > would be suboptimal.
> > > 
> > > Remove the condition introduced in commit 69d25870f20c
> > > ("cpuidle: fix the menu governor to boost IO performance") that
> > > prioritized non-polling states even when their target residency
> > > exceeded the predicted idle duration and allow polling states to
> > > be selected when appropriate.
> > > 
> > > Performance improvement observed with pgbench on PowerPC (pseries)
> > > system:
> > > +---------------------------+------------+------------+------------+
> > > > Metric                    | Baseline   | Patched    | Change (%) |
> > > +---------------------------+------------+------------+------------+
> > > > Transactions/sec (TPS)    | 494,834    | 538,707    | +8.85%     |
> > > > Avg latency (ms)          | 0.162      | 0.149      | -8.02%     |
> > > +---------------------------+------------+------------+------------+
> > > 
> > > CPUIdle state usage:
> > > +--------------+--------------+-------------+
> > > > Metric       | Baseline     | Patched     |
> > > +--------------+--------------+-------------+
> > > > Total usage  | 12,703,630   | 13,941,966  |
> > > > Above usage  | 11,388,990   | 1,620,474   |
> > > > Below usage  | 19,973       | 684,708     |
> > > +--------------+--------------+-------------+
> > > 
> > > Above/Total and Below/Total usage percentages:
> > > +------------------------+-----------+---------+
> > > > Metric                 | Baseline  | Patched |
> > > +------------------------+-----------+---------+
> > > > Above % (Above/Total)  | 89.67%    | 11.63%  |
> > > > Below % (Below/Total)  | 0.16%     | 4.91%   |
> > > > Total cpuidle miss (%) | 89.83%    | 16.54%  |
> > > +------------------------+-----------+---------+
> > > 
> > > Signed-off-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>
> > > 
> > > ---
> > > 
> > > v1: https://lore.kernel.org/all/20240809073120.250974-1-aboorvad@linux.ibm.com/
> > > 
> > > v1 -> v2:
> > > 
> > > - Drop cover letter and improve commit message.
> > > ---
> > >  drivers/cpuidle/governors/menu.c | 11 -----------
> > >  1 file changed, 11 deletions(-)
> > > 
> > > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> > > index 28363bfa3e4c..4b199377e4be 100644
> > > --- a/drivers/cpuidle/governors/menu.c
> > > +++ b/drivers/cpuidle/governors/menu.c
> > > @@ -296,17 +296,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > >                       idx = i; /* first enabled state */
> > > 
> > >               if (s->target_residency_ns > predicted_ns) {
> > > -                     /*
> > > -                      * Use a physical idle state, not busy polling, unless
> > > -                      * a timer is going to trigger soon enough.
> > > -                      */
> > > -                     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> > > -                         s->exit_latency_ns <= latency_req &&
> > > -                         s->target_residency_ns <= data->next_timer_ns) {
> > > -                             predicted_ns = s->target_residency_ns;
> > > -                             idx = i;
> > > -                             break;
> > > -                     }
> > >                       if (predicted_ns < TICK_NSEC)
> > >                               break;
> > > 
> > 
> > I'm still fine with this and don't see a better way to solve the reported
> > issue generally, see the discussion on v1.
> > Rafael, do you have any objections?
> 
> The behavior change on x86 would be rather unacceptable I'm afraid.
> 
> As already stated in a different thread today, on x86 the polling
> state turns out to be overly shallow most of the time even before this
> patch.
> 
> > We could make this conditional on there being a high latency difference between
> > polling and non-polling to keep x86 behavior.
> 
> If I'm not mistaken, to address the issue at hand, it would be
> sufficient to add an "s->target_residency_ns < RESIDENCY_THRESHOLD_NS"
> condition to the if () statement removed by this patch, wouldn't it?

Hi Rafael,

Thanks for your response. Since the first physical state for the pseries system
(PowerVM) has a target residency of is 120 us, adding the condition
`s->target_residency_ns < RESIDENCY_THRESHOLD_NS` will indeed resolve the issue.

This change will ensure we only switch if the target residency of the physical polling
state is below 15 us which should be good.


Basic CPU Idle test: (1) carried out on a PowerVM (Power10) system.

Base Kernel:   
 CPU Wakee  CPU Waker  Sleep Interval (us)  Total Usage Diff  Total Time Diff (ns)  Total Above Diff  Total Below Diff  Above %  Below %
        10         39                2.585            302245                463259                 5                 0     0.00     0.00
        10         39                4.735           1023429               2449157                 7                 2     0.00     0.00
        10         39                7.339           1321973               2542391                 6                 0     0.00     0.00
        10         39               12.376            798039               5429695                 5                 0     0.00     0.00
        10         39               22.373            443568               7677710            232439                 0    52.40     0.00
        10         39               32.472            306655               8514874            306552                 0    99.97     0.00
        10         39               42.385            235004               8800135            231795                 1    98.63     0.00
        10         39               52.461            190079               9029907            187694                 1    98.75     0.00
        10         39               62.473            159704               9182151            157656                 0    98.72     0.00
        10         39               72.442            137826               9282009            136055                 1    98.72     0.00
        10         39               82.462            120114               9308294            118640                 1    98.77     0.00
        10         39               92.464            106081               9278734            104725                 2    98.72     0.00
        10         39              102.477             96663               9422966             95415                 2    98.71     0.00
        10         39              112.413             88117               9459074             86972                 1    98.70     0.00
        10         39              122.465             80942               9497520             79064                10    97.68     0.01
        10         39              132.430             75825               9544063              1035               913     1.36     1.20
        10         39              142.442             70511               9561624               969               843     1.37     1.20
        10         39              152.449             65929               9586042               936               762     1.42     1.16
        10         39              202.425             49782               9636037               960               591     1.93     1.19
        10         39              302.405             33770               9749277               415               392     1.23     1.16
        10         39              402.443             25403               9701667               306               292     1.20     1.15
        10         39              502.433             20751               9826491               245               236     1.18     1.14
        10         39              602.451             17468               9832612               208               203     1.19     1.16
        10         39              702.492             15135               9847269               193               168     1.28     1.11
        10         39              802.501             13369               9856279               159               151     1.19     1.13
        10         39              902.507             12007               9862544               147               138     1.22     1.15
        10         39             1002.499             10906               9865485               132               135     1.21     1.24
        
With suggested change:     
 CPU Wakee  CPU Waker  Sleep Interval (us)  Total Usage Diff  Total Time Diff (ns)  Total Above Diff  Total Below Diff  Above %  Below %
        10         39                2.682            247169                433222                 1                 0     0.00     0.00
        10         39                4.803           1007178               2560799                 6                 0     0.00     0.00
        10         39                7.357           1327004               2754768                 6                 1     0.00     0.00
        10         39               12.362            799098               5556114                 6                 0     0.00     0.00
        10         39               22.380            443901               7402339                 6                 2     0.00     0.00
        10         39               32.388            307357               8190282                 6                 1     0.00     0.00
        10         39               42.423            234908               8625490                 6                 0     0.00     0.00
        10         39               52.425            190224               8882797                 0                 0     0.00     0.00
        10         39               62.354            159019               9000032                 6                 0     0.00     0.00
        10         39               72.380            137887               9182585                11                 4     0.01     0.00
        10         39               82.412            121129               9285328                 2                 1     0.00     0.00
        10         39               92.398            107844               9326098                 3                15     0.00     0.01
        10         39              102.456             97320               9408519                 3                 1     0.00     0.00
        10         39              112.358             88884               9468555                 6                 4     0.01     0.00
        10         39              122.402             81630               9517007                 6                42     0.01     0.05
        10         39              132.359            111082               9528254             38190             37018    34.38    33.32
        10         39              142.368             71612               9636036              1309              1318     1.83     1.84
        10         39              152.466             66791               9663690              1152              1086     1.72     1.63
        10         39              202.453             50412               9745668               999               707     1.98     1.40
        10         39              302.454             34045               9828326               429               427     1.26     1.25
        10         39              402.480             25851               9868703               319               328     1.23     1.27
        10         39              502.407             20931               9877533               273               271     1.30     1.29
        10         39              602.440             17602               9878219               224               225     1.27     1.28
        10         39              702.465             15254               9915818               192               196     1.26     1.28
        10         39              802.452             13462               9930026               161               158     1.20     1.17
        10         39              902.461             12089               9933506               144               138     1.19     1.14
        10         39             1002.483             10988               9942937               130               130     1.18     1.18
        



I'm a bit curious on why RESIDENCY_THRESHOLD_NS is hardcoded to 15 us? Would it make sense to make this value tunable instead?

I actually have a patch that makes it configurable via sysfs to include it along with this change if you think that would be useful. 

Looking forward for your comments on this.

Thanks and regards,
Aboorva

[1]: https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ