[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20f31d99374ef1ffa7dc7fc1a609074009742e00.camel@linux.ibm.com>
Date: Mon, 21 Oct 2024 10:57:18 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: Christian Loehle <christian.loehle@....com>, rafael@...nel.org,
daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from
favoring physical over polling cpuidle state
On Thu, 2024-09-19 at 09:49 +0100, Christian Loehle wrote:
Hi Christian,
> On 9/19/24 06:02, Aboorva Devarajan wrote:
> > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> >
> > > On 8/20/24 09:51, Aboorva Devarajan wrote:
> > > > On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
> > > >
> > ...
> > > The wakeup source(s), since they don't seem to be timer events would be
> > > interesting, although a bit of a hassle to get right.
> > > What's the workload anyway?
> > >
> >
> > I identified the part of the internal workload responsible for performance
> > improvement with the patch and it appears to be the OLTP queries, and yes
> > the wakeup sources are not timer events.
> >
> > ---------------------------------------------------------------------
> >
> > Additionally to reproduce the issue using a more open and accessible
> > benchmark, conducted similar experiments using pgbench and I observed
> > similar improvements with the patch, particularly when running the
> > read intensive select query benchmarks.
> >
> > ---------------------------------------------------------------------
> > System Details
> > ---------------------------------------------------------------------
> > $ lscpu
> > Architecture: ppc64le
> > Byte Order: Little Endian
> > CPU(s): 224
> > On-line CPU(s) list: 0-223
> > Model name: POWER10 (architected), altivec supported
> > Model: 2.0 (pvr 0080 0200)
> > Thread(s) per core: 8
> > Core(s) per socket: 3
> > Socket(s): 8
> > Virtualization features:
> > Hypervisor vendor: pHyp
> > Virtualization type: para
> > ---------------------------------------------------------------------
> >
> > $ cpupower idle-info
> > CPUidle driver: pseries_idle
> > CPUidle governor: menu
> > analyzing CPU 0:
> >
> > Number of idle states: 2
> > Available idle states: snooze CEDE
> > snooze:
> > Flags/Description: snooze
> >
> > Latency: 0
> > Residency: 0
> > Usage: 6229
> > Duration: 402142
> > CEDE:
> > Flags/Description: CEDE
> > Latency: 12
> > Residency: 120
> > Usage: 191411
> > Duration: 36329999037
> >
> > ---------------------------------------------------------------------
> > PostgreSQL Benchmark:
> > ---------------------------------------------------------------------
> >
> > I ran pgbench with 224 clients and 20 threads for 600 seconds,
> > performing only SELECT queries against the pgbench database to
> > evaluate performance under read-intensive workloads:
> >
> > $ pgbench -c 224 -j 20 -T 600 -S pgbench
> >
> > Latency:
> >
> > > ---|-------------|------------|------------|
> > > # | Before (ms) | After (ms) | Change (%) |
> > > Run| Patch | Patch | |
> > > ---|-------------|------------|------------|
> > > 1 | 0.343 | 0.287 | -16.31% |
> > > 2 | 0.334 | 0.286 | -14.37% |
> > > 3 | 0.333 | 0.286 | -14.11% |
> > > 4 | 0.341 | 0.288 | -15.55% |
> > > 5 | 0.342 | 0.288 | -15.79% |
> > > ---|-------------|------------|------------|
> >
> > Latency Reduction: After applying the patch, the latency decreased
> > by 14% to 16% across multiple runs.
> >
> > Throughput per second:
> >
> > > ---|-------------|------------|------------|
> > > # | Before | After | Change (%) |
> > > Run| Patch | Patch | |
> > > ---|-------------|------------|------------|
> > > 1 | 652,544.23 | 780,613.42 | +19.63% |
> > > 2 | 670,243.45 | 784,348.44 | +17.04% |
> > > 3 | 673,495.39 | 784,458.12 | +16.48% |
> > > 4 | 656,609.16 | 778,531.20 | +18.57% |
> > > 5 | 654,644.52 | 778,322.88 | +18.88% |
> > > ---|-------------|------------|------------|
>
> Do you happen to have the idle stats here, too?
> Especially above/below see comment below.
I ran the benchmark again and collected the idle statistics on a different
system since I no longer have access to the previous one.
Here are the stats I gathered, including data for the menu, teo, and ladder governors:
Metric | Ladder | Teo | Menu | Menu Patched
--------------------------------------------------------------------------------------------
Transactions Processed | 42,902,188 | 48,499,709 | 42,653,477 | 48,648,867
Latency (ms) | 0.313 | 0.277 | 0.315 | 0.276
TPS | 715,340.96 | 808,672.59 | 711,123.13 | 811,137.40
--------------------------------------------------------------------------------------------
Total Usage Difference | 46,680,184 | 66,348,992 | 47,892,509 | 62,366,337
Usage Diff (CEDE) | 46,680,184 | 7,612,960 | 45,421,436 | 19,237,430
Usage Diff (Snooze) | 0 | 58,736,032 | 2,471,073 | 43,128,907
--------------------------------------------------------------------------------------------
Total Time Difference (s) | 5,552.20 | 5,242.75 | 5,534.62 | 5,238.46
Time Diff (CEDE) | 5,552.20 | 2,497.89 | 5,431.91 | 3,324.99
Time Diff (Snooze) | 0.00 | 2,744.86 | 102.71 | 1,913.47
--------------------------------------------------------------------------------------------
Total Above Diff | 40,381,398 | 4,520,998 | 38,942,725 | 15,072,345
Above Diff (CEDE) | 40,381,398 | 4,520,998 | 38,942,725 | 15,072,345
Above Diff (Snooze) | 0 | 0 | 0 | 0
--------------------------------------------------------------------------------------------
Total Below Diff | 0 | 12,362,392 | 405,357 | 8,172,845
Below Diff (CEDE) | 0 | 0 | 0 | 0
Below Diff (Snooze) | 0 | 12,362,392 | 405,357 | 8,172,845
--------------------------------------------------------------------------------------------
% Above w.r.t. Usage | 86.51% | 6.82% | 81.30% | 24.17%
--------------------------------------------------------------------------------------------
% Below w.r.t. Usage | 0% | 18.64% | 0.85% | 13.10%
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
Note:
% Above w.r.t Usage = (Total Above Difference / Total Usage Difference) * 100
% Below w.r.t Usage = (Total Below Difference / Total Usage Difference) * 100
Menu, Teo, Ladder: This is with menu, teo and ladder governor enabled respectively on v6.12-rc1.
Menu Patched: This is with menu governor enabled on v6.12-rc +
https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
Inference:
--------------------------------------------------------------------------------------------
Transactions Processed: The number of transactions processed in Menu Patched is
14.06% higher compared to Menu.
Latency: There is a 12.38% reduction in latency in Menu Patched compared to Menu.
TPS (Transactions Per Second): The TPS in Menu Patched is 14.06% higher than in
Menu.
--------------------------------------------------------------------------------------------
While this patch hasn't completely eliminated the cpuidle miss ratio, but it
has improved see Above w.r.t Usage, Below w.r.t Usage.
I'll keep investigating why there's still a 24% miss rate in the "above" and
13% in the "below" stats after the patch. This could be a different issue.
Additionally, I've seen good performance improvements using the teo governor
with the pgbench benchmark, although it didn't provide the same benefit in the
original test.
>
> > Transactions per second Improvement: The patch led to an increase in
> > TPS, ranging from 16% to 19%.
> >
> > This indicates that the patch significantly reduces latency while
> > improving throughput (TPS). pgbench is an OLTP benchmark and doesn't
> > do timer based wakeups, this is in-line with the improvements
> > I saw in the originally reported OLTP workload as well.
> >
> > ---------------------------------------------------------------------
> > Additional CPU Idle test with custom benchmark:
> > ---------------------------------------------------------------------
> > I also wrote a simple benchmark [1] to analyze the impact of cpuidle
> > state selection, comparing timer-based wakeups and non-timer
> > (pipe-based) wakeups.
> >
> > This test involves a single waker thread periodically waking up a
> > single wakee thread, simulating a repeating sleep-wake pattern. The
> > test was run with both timer-based and pipe-based wakeups, and cpuidle
> > statistics (usage, time, above, below) were collected.
>
> Thanks, pretty cool and just quickly checking it seems to be all there:
> hrtimer, different cpusets for wakee and waker, the pipe wakeup.
> That could be useful, only thing missing is the license header ;)
sure, I will add this. Thanks for pointing it out.
>
> > Timer based wakeup:
> >
> > +------------------------+---------------------+---------------------+
> > > Metric | Without Patch | With Patch |
> > +------------------------+---------------------+---------------------+
> > > Wakee thread - CPU | 110 | 110 |
> > > Waker thread - CPU | 20 | 20 |
> > > Sleep Interval | 50 us | 50 us |
> > > Total Wakeups | - | - |
> > > Avg Wakeup Latency | - | - |
> > > Actual Sleep time | 52.639 us | 52.627 us |
> > +------------------------+---------------------+---------------------+
> > > Idle State 0 Usage Diff| 94,879 | 94,879 |
> > > Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns |
> > > Idle State 1 Usage Diff| 0 | 0 |
> > > Idle State 1 Time Diff | 0 ns | 0 ns |
> > +------------------------+---------------------+---------------------+
> > > Total Above Usage | 0 (0.00%) | (0.00%) |
> > +------------------------+---------------------+---------------------+
> > > Total Below Usage | 0 (0.00%) | 0 (0.00%) |
> > +------------------------+---------------------+---------------------+
> >
> > In timer-based wakeups, the menu governor effectively predicts the idle
> > duration both with and without the patch. This ensures that there are
> > few or no instances of "Above" usage, allowing the CPU to remain in the
> > correct idle state.
> >
> > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > governor ensures that a physical idle state is not prioritized when a
> > timer event is expected before the target residency of the first physical
> > idle state.
> >
> > As a result, the patch has no impact in this case, and performance
> > remains stable with timer based wakeups.
> >
> > Pipe based wakeup (non-timer wakeup):
> >
> > +------------------------+---------------------+---------------------+
> > > Metric | Without Patch | With Patch |
> > +------------------------+---------------------+---------------------+
> > > Wakee thread - CPU | 110 | 110 |
> > > Waker thread - CPU | 20 | 20 |
> > > Sleep Interval | 50 us | 50 us |
> > > Total Wakeups | 97031 | 96583 |
> > > Avg Wakeup Latency | 7.070 us | 4.879 us |
> > > Actual Sleep time | 51.366 us | 51.605 us |
> > +------------------------+---------------------+---------------------+
> > > Idle State 0 Usage Diff| 1209 | 96,586 |
> > > Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns |
> > > Idle State 1 Usage Diff| 95,826 | 5 |
> > > Idle State 1 Time Diff | 4,522,639 ns | 198 ns |
> > +------------------------+---------------------+---------------------+
> > +------------------------+---------------------+---------------------+
> > > **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) |
> > +------------------------+---------------------+---------------------+
> > +------------------------+---------------------+---------------------+
> > > Total Below Usage | 0 (0.00%) | 0 (0.00%) |
> > +------------------------+---------------------+---------------------+
> >
> > In the pipe-based wakeup scenario, before the patch was applied, the
> > "Above" metric was notably high around 98.75%. This suggests that the
> > menu governor frequently selected a deeper idle state like CEDE, even
> > when the idle period was relatively short.
> >
> > This happened because the menu governor is inclined to prioritize the
> > physical idle state (CEDE) even when the target residency time of the
> > physical idle state (s->target_residency_ns) was longer than the
> > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > relevant here in non-timer wakeups.
> >
> > In this test, despite the actual idle period being around 50 microseconds,
> > the menu governor still chose CEDE state, which has a target residency of
> > 120 microseconds.
>
> And the idle_miss statistics confirms that this was mostly wrong decisions
> in hindsight.
> I'll go through the menu code again, this indeed shouldn't be happening.
> I'd be very surprised if upstream teo performed as badly (or actually badly
> at all) on this, although it doesn't seem to help your actual workload,
> would you mind giving that a try too?
I ran a similar benchmark using teo cpuidle governor, presented the averaged out
values across 10 runs (has low standard deviation). Below are the results for a
pipe-based wakeup with an approximate 50 microsecond sleep interval:
Pipe based wakeup with approx 50 us as sleep interval:
Metric Ladder Menu Teo Menu Patched
----------------------------------------------------------------------------------------
Wakeups 579,690 579,951 578,328 578,363
Avg wakeup latency (us) 7.456 7.112 4.46067 4.48733
========================================================================================
Sleep interval (us) 51.7333 51.7097 51.8533 51.8477
========================================================================================
State 0 Usage diff 0 7,324 578,361 578,407
State 0 Time diff (ns) 0 340,115 2.75953e+07 2.75696e+07
State 0 Above diff 0 0 0 0
State 0 Below diff 0 0 0.333333 0.666667
========================================================================================
State 1 Usage diff 580,114 572,644 0.666667 9.33333
State 1 Time diff (ns) 2.74189e+07 2.73901e+07 20.6667 445.333
State 1 Above diff 579,993 572,535 0.666667 9.33333
State 1 Below diff 0 0 0 0
Total Above 579,993 572,535 0.666667 9.33333
Total Below 0 0 0.333333 0.666667
========================================================================================
Total Above Percent 99.98% 98.7167% 0% 0% --> [1]
Total Below Percent 0% 0% 0% 0%
========================================================================================
Timer based wakeup with approx 50 us as sleep interval:
Metric Ladder Menu Teo Menu Patched
----------------------------------------------------------------------------------------
Sleep interval (us) 53.775 52.3687 52.746 52.327
========================================================================================
State 0 Usage diff 0 572,575 568,419 573,109
State 0 Time diff (ns) 0 2.84142e+07 2.81956e+07 2.84548e+07
State 0 Above diff 0 0 0 0
State 0 Below diff 0 0.333333 1 0.333333
========================================================================================
State 1 Usage diff 558,285 0.333333 0 0
State 1 Time diff (ns) 2.80965e+07 17 0 0
State 1 Above diff 558,215 0.333333 0 0
State 1 Below diff 0 0 0 0
========================================================================================
Total Above 558,215 0.333333 0 0
Total Below 0 0.333333 1 0.333333
========================================================================================
Total Above Percent 99.99% 0% 0% 0%
Total Below Percent 0% 0% 0% 0%
========================================================================================
[1] The results does show that the teo governor doesn't have this issue.
>
> Regards,
> Christian
>
> > ---------------------------------------------------------------------
> >
> > While timer-based wakeups performed well even without the patch, workloads
> > that don't have timers as wakeup source but have predictable idle durations
> > shorter than the first idle state's target residency benefit significantly
> > from the patch.
> >
> > It will be helpful to understand why prioritizing deep physical idle
> > states over shallow ones even for short idle periods that don’t meet
> > the target residency like mentioned above is considered more beneficial.
> >
> > Is there something I could be missing here?
> >
> > Any comments or suggestions will be really helpful.
> >
> > [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> >
> > Thanks,
> > Aboorva
> >
Powered by blists - more mailing lists