linux-kernel - Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20f31d99374ef1ffa7dc7fc1a609074009742e00.camel@linux.ibm.com>
Date: Mon, 21 Oct 2024 10:57:18 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: Christian Loehle <christian.loehle@....com>, rafael@...nel.org,
        daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from
 favoring physical over polling cpuidle state

On Thu, 2024-09-19 at 09:49 +0100, Christian Loehle wrote:


Hi Christian, 

> On 9/19/24 06:02, Aboorva Devarajan wrote:
> > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> > 
> > > On 8/20/24 09:51, Aboorva Devarajan wrote:
> > > > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
> > > > 
> > ...
> > > The wakeup source(s), since they don't seem to be timer events would be
> > > interesting, although a bit of a hassle to get right. 
> > > What's the workload anyway?
> > > 
> > 
> > I identified the part of the internal workload responsible for performance
> > improvement with the patch and it appears to be the OLTP queries, and yes
> > the wakeup sources are not timer events.
> > 
> > ---------------------------------------------------------------------
> > 
> > Additionally to reproduce the issue using a more open and accessible
> > benchmark, conducted similar experiments using pgbench and I observed
> > similar improvements with the patch, particularly when running the
> > read intensive select query benchmarks.
> > 
> > ---------------------------------------------------------------------
> > System Details
> > ---------------------------------------------------------------------
> > $ lscpu
> > Architecture:             ppc64le
> >   Byte Order:             Little Endian
> > CPU(s):                   224
> >   On-line CPU(s) list:    0-223
> > Model name:               POWER10 (architected), altivec supported
> >   Model:                  2.0 (pvr 0080 0200)
> >   Thread(s) per core:     8
> >   Core(s) per socket:     3
> >   Socket(s):              8
> > Virtualization features:  
> >   Hypervisor vendor:      pHyp
> >   Virtualization type:    para
> > ---------------------------------------------------------------------
> > 
> > $ cpupower idle-info
> > CPUidle driver: pseries_idle
> > CPUidle governor: menu
> > analyzing CPU 0:
> > 
> > Number of idle states: 2
> > Available idle states: snooze CEDE
> > snooze:
> > Flags/Description: snooze
> > 
> > Latency: 0
> > Residency: 0
> > Usage: 6229
> > Duration: 402142
> > CEDE:
> > Flags/Description: CEDE
> > Latency: 12
> > Residency: 120
> > Usage: 191411
> > Duration: 36329999037
> > 
> > ---------------------------------------------------------------------
> > PostgreSQL Benchmark:
> > ---------------------------------------------------------------------
> > 
> > I ran pgbench with 224 clients and 20 threads for 600 seconds,
> > performing only SELECT queries against the pgbench database to 
> > evaluate performance under read-intensive workloads:
> > 
> > $ pgbench -c 224 -j 20 -T 600 -S pgbench
> > 
> > Latency:
> > 
> > > ---|-------------|------------|------------|
> > >  # | Before (ms) | After (ms) | Change (%) |
> > > Run| Patch       | Patch      |            |
> > > ---|-------------|------------|------------|
> > >  1 | 0.343       | 0.287      | -16.31%    |
> > >  2 | 0.334       | 0.286      | -14.37%    |
> > >  3 | 0.333       | 0.286      | -14.11%    |
> > >  4 | 0.341       | 0.288      | -15.55%    |
> > >  5 | 0.342       | 0.288      | -15.79%    |
> > > ---|-------------|------------|------------|
> > 
> > Latency Reduction: After applying the patch, the latency decreased
> > by 14% to 16% across multiple runs.
> > 
> > Throughput per second:
> > 
> > > ---|-------------|------------|------------|
> > >  # | Before      | After      | Change (%) |
> > > Run| Patch       | Patch      |            |
> > > ---|-------------|------------|------------|
> > >  1 | 652,544.23  | 780,613.42 | +19.63%    | 
> > >  2 | 670,243.45  | 784,348.44 | +17.04%    |
> > >  3 | 673,495.39  | 784,458.12 | +16.48%    |
> > >  4 | 656,609.16  | 778,531.20 | +18.57%    |
> > >  5 | 654,644.52  | 778,322.88 | +18.88%    |
> > > ---|-------------|------------|------------|
> 
> Do you happen to have the idle stats here, too?
> Especially above/below see comment below.

I ran the benchmark again and collected the idle statistics on a different
system since I no longer have access to the previous one.

Here are the stats I gathered, including data for the menu, teo, and ladder governors:

Metric                        | Ladder        | Teo           | Menu          | Menu Patched
--------------------------------------------------------------------------------------------
Transactions Processed        | 42,902,188    | 48,499,709    | 42,653,477    | 48,648,867   
Latency (ms)                  | 0.313         | 0.277         | 0.315         | 0.276        
TPS                           | 715,340.96    | 808,672.59    | 711,123.13    | 811,137.40  
--------------------------------------------------------------------------------------------
Total Usage Difference        | 46,680,184    | 66,348,992    | 47,892,509    | 62,366,337
Usage Diff (CEDE)             | 46,680,184    | 7,612,960     | 45,421,436    | 19,237,430
Usage Diff (Snooze)           | 0             | 58,736,032    | 2,471,073     | 43,128,907
--------------------------------------------------------------------------------------------
Total Time Difference (s)     | 5,552.20      | 5,242.75      | 5,534.62      | 5,238.46     
Time Diff (CEDE)              | 5,552.20      | 2,497.89      | 5,431.91      | 3,324.99
Time Diff (Snooze)            | 0.00          | 2,744.86      | 102.71        | 1,913.47
--------------------------------------------------------------------------------------------
Total Above Diff              | 40,381,398    | 4,520,998     | 38,942,725    | 15,072,345
Above Diff (CEDE)             | 40,381,398    | 4,520,998     | 38,942,725    | 15,072,345
Above Diff (Snooze)           | 0             | 0             | 0             | 0
--------------------------------------------------------------------------------------------
Total Below Diff              | 0             | 12,362,392    | 405,357       | 8,172,845
Below Diff (CEDE)             | 0             | 0             | 0             | 0
Below Diff (Snooze)           | 0             | 12,362,392    | 405,357       | 8,172,845
--------------------------------------------------------------------------------------------
% Above w.r.t. Usage          | 86.51%        | 6.82%         | 81.30%        | 24.17%
--------------------------------------------------------------------------------------------
% Below w.r.t. Usage          | 0%            | 18.64%        | 0.85%         | 13.10%
--------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------

Note:

% Above w.r.t Usage = (Total Above Difference / Total Usage Difference) * 100
% Below w.r.t Usage = (Total Below Difference / Total Usage Difference) * 100


Menu, Teo, Ladder: This is with menu, teo and ladder governor enabled respectively on v6.12-rc1.

Menu Patched: This is with menu governor enabled on v6.12-rc + 
              https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/
--------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------
Inference:
--------------------------------------------------------------------------------------------

Transactions Processed: The number of transactions processed in Menu Patched is
14.06% higher compared to Menu.

Latency: There is a 12.38% reduction in latency in Menu Patched compared to Menu.

TPS (Transactions Per Second): The TPS in Menu Patched is 14.06% higher than in
Menu.
--------------------------------------------------------------------------------------------

While this patch hasn't completely eliminated the cpuidle miss ratio, but it
has improved see Above w.r.t Usage, Below w.r.t Usage.

I'll keep investigating why there's still a 24% miss rate in the "above" and
13% in the "below" stats after the patch. This could be a different issue.
Additionally, I've seen good performance improvements using the teo governor
with the pgbench benchmark, although it didn't provide the same benefit in the
original test.

> 
> > Transactions per second Improvement: The patch led to an increase in
> > TPS, ranging from 16% to 19%.
> > 
> > This indicates that the patch significantly reduces latency while
> > improving throughput (TPS).  pgbench is an OLTP benchmark and doesn't
> > do timer based wakeups, this is in-line with the improvements
> > I saw in the originally reported OLTP workload as well. 
> > 
> > ---------------------------------------------------------------------
> > Additional CPU Idle test with custom benchmark:
> > ---------------------------------------------------------------------
> > I also wrote a simple benchmark [1] to analyze the impact of cpuidle
> > state selection, comparing timer-based wakeups and non-timer
> > (pipe-based) wakeups.
> > 
> > This test involves a single waker thread periodically waking up a
> > single wakee thread, simulating a repeating sleep-wake pattern. The
> > test was run with both timer-based and pipe-based wakeups, and cpuidle
> > statistics (usage, time, above, below) were collected.
> 
> Thanks, pretty cool and just quickly checking it seems to be all there:
> hrtimer, different cpusets for wakee and waker, the pipe wakeup.
> That could be useful, only thing missing is the license header ;)

sure, I will add this. Thanks for pointing it out.

> 
> > Timer based wakeup:
> > 
> > +------------------------+---------------------+---------------------+
> > > Metric                 | Without Patch       | With Patch          |
> > +------------------------+---------------------+---------------------+
> > > Wakee thread - CPU     | 110                 | 110                 |
> > > Waker thread - CPU     | 20                  | 20                  |
> > > Sleep Interval         | 50 us               | 50 us               |
> > > Total Wakeups          | -                   | -                   |
> > > Avg Wakeup Latency     | -                   | -                   |
> > > Actual Sleep time      | 52.639 us           | 52.627 us           |
> > +------------------------+---------------------+---------------------+
> > > Idle State 0 Usage Diff| 94,879              | 94,879              |
> > > Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> > > Idle State 1 Usage Diff| 0                   | 0                   |
> > > Idle State 1 Time Diff | 0 ns                | 0 ns                |
> > +------------------------+---------------------+---------------------+
> > > Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> > +------------------------+---------------------+---------------------+
> > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > +------------------------+---------------------+---------------------+
> > 
> > In timer-based wakeups, the menu governor effectively predicts the idle
> > duration both with and without the patch. This ensures that there are
> > few or no instances of "Above" usage, allowing the CPU to remain in the
> > correct idle state.
> > 
> > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > governor ensures that a physical idle state is not prioritized when a
> > timer event is expected before the target residency of the first physical
> > idle state.
> > 
> > As a result, the patch has no impact in this case, and performance
> > remains stable with timer based wakeups.
> > 
> > Pipe based wakeup (non-timer wakeup):
> > 
> > +------------------------+---------------------+---------------------+
> > > Metric                 | Without Patch       | With Patch          |
> > +------------------------+---------------------+---------------------+
> > > Wakee thread - CPU     | 110                 | 110                 |
> > > Waker thread - CPU     | 20                  | 20                  |
> > > Sleep Interval         | 50 us               | 50 us               |
> > > Total Wakeups          | 97031               | 96583               |
> > > Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> > > Actual Sleep time      | 51.366 us           | 51.605 us           |
> > +------------------------+---------------------+---------------------+
> > > Idle State 0 Usage Diff| 1209                | 96,586              |
> > > Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> > > Idle State 1 Usage Diff| 95,826              | 5                   |
> > > Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> > +------------------------+---------------------+---------------------+
> > +------------------------+---------------------+---------------------+
> > > **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> > +------------------------+---------------------+---------------------+     
> > +------------------------+---------------------+---------------------+
> > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > +------------------------+---------------------+---------------------+
> > 
> > In the pipe-based wakeup scenario, before the patch was applied, the 
> > "Above" metric was notably high around 98.75%. This suggests that the
> > menu governor frequently selected a deeper idle state like CEDE, even
> > when the idle period was relatively short.
> > 
> > This happened because the menu governor is inclined to prioritize the
> > physical idle state (CEDE) even when the target residency time of the
> > physical idle state (s->target_residency_ns) was longer than the
> > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > relevant here in non-timer wakeups.
> > 
> > In this test, despite the actual idle period being around 50 microseconds,
> > the menu governor still chose CEDE state, which has a target residency of
> > 120 microseconds.
> 
> And the idle_miss statistics confirms that this was mostly wrong decisions
> in hindsight.
> I'll go through the menu code again, this indeed shouldn't be happening.
> I'd be very surprised if upstream teo performed as badly (or actually badly
> at all) on this, although it doesn't seem to help your actual workload,
> would you mind giving that a try too?


I ran a similar benchmark using teo cpuidle governor, presented the averaged out
values across 10 runs (has low standard deviation). Below are the results for a
pipe-based wakeup with an approximate 50 microsecond sleep interval:


Pipe based wakeup with approx 50 us as sleep interval:

Metric                    Ladder           Menu            Teo              Menu Patched
----------------------------------------------------------------------------------------
Wakeups                   579,690          579,951         578,328          578,363
Avg wakeup latency (us)   7.456            7.112           4.46067          4.48733
========================================================================================
Sleep interval (us)       51.7333          51.7097         51.8533          51.8477
========================================================================================
State 0 Usage diff        0                7,324           578,361          578,407
State 0 Time diff (ns)    0                340,115         2.75953e+07      2.75696e+07
State 0 Above diff        0                0               0                0
State 0 Below diff        0                0               0.333333         0.666667
========================================================================================
State 1 Usage diff        580,114          572,644         0.666667         9.33333
State 1 Time diff (ns)    2.74189e+07      2.73901e+07     20.6667          445.333
State 1 Above diff        579,993          572,535         0.666667         9.33333
State 1 Below diff        0                0               0                0
Total Above               579,993          572,535         0.666667         9.33333
Total Below               0                0               0.333333         0.666667
========================================================================================
Total Above Percent       99.98%           98.7167%        0%               0%             --> [1]
Total Below Percent       0%               0%              0%               0%
========================================================================================

Timer based wakeup with approx 50 us as sleep interval:

Metric                    Ladder           Menu            Teo              Menu Patched
----------------------------------------------------------------------------------------
Sleep interval (us)        53.775           52.3687         52.746           52.327
========================================================================================
State 0 Usage diff         0                572,575         568,419          573,109
State 0 Time diff (ns)     0                2.84142e+07     2.81956e+07      2.84548e+07
State 0 Above diff         0                0               0                0
State 0 Below diff         0                0.333333        1                0.333333
========================================================================================
State 1 Usage diff         558,285          0.333333        0                0
State 1 Time diff (ns)     2.80965e+07      17              0                0
State 1 Above diff         558,215          0.333333        0                0
State 1 Below diff         0                0               0                0
========================================================================================
Total Above                558,215          0.333333        0                0
Total Below                0                0.333333        1                0.333333
========================================================================================
Total Above Percent        99.99%           0%              0%               0%
Total Below Percent        0%               0%              0%               0%
========================================================================================


[1] The results does show that the teo governor doesn't have this issue.

> 
> Regards,
> Christian
> 
> > ---------------------------------------------------------------------
> > 
> > While timer-based wakeups performed well even without the patch, workloads
> > that don't have timers as wakeup source but have predictable idle durations
> > shorter than the first idle state's target residency benefit significantly
> > from the patch.
> > 
> > It will be helpful to understand why prioritizing deep physical idle
> > states over shallow ones even for short idle periods that don’t meet
> > the target residency like mentioned above is considered more beneficial.
> > 
> > Is there something I could be missing here?
> > 
> > Any comments or suggestions will be really helpful.
> > 
> > [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> > 
> > Thanks,
> > Aboorva
> >