lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6ae997ca-9712-4497-b666-11b976b1816d@arm.com>
Date: Thu, 19 Sep 2024 09:49:34 +0100
From: Christian Loehle <christian.loehle@....com>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>, rafael@...nel.org,
 daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
 linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring
 physical over polling cpuidle state

On 9/19/24 06:02, Aboorva Devarajan wrote:
> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> 
>> On 8/20/24 09:51, Aboorva Devarajan wrote:
>>> On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
>>>
> ...
>> The wakeup source(s), since they don't seem to be timer events would be
>> interesting, although a bit of a hassle to get right. 
>> What's the workload anyway?
>>
> 
> Hi Christian,
> 
> Apologies for the delayed response.

No worries!

> 
> I identified the part of the internal workload responsible for performance
> improvement with the patch and it appears to be the OLTP queries, and yes
> the wakeup sources are not timer events.
> 
> ---------------------------------------------------------------------
> 
> Additionally to reproduce the issue using a more open and accessible
> benchmark, conducted similar experiments using pgbench and I observed
> similar improvements with the patch, particularly when running the
> read intensive select query benchmarks.
> 
> ---------------------------------------------------------------------
> System Details
> ---------------------------------------------------------------------
> $ lscpu
> Architecture:             ppc64le
>   Byte Order:             Little Endian
> CPU(s):                   224
>   On-line CPU(s) list:    0-223
> Model name:               POWER10 (architected), altivec supported
>   Model:                  2.0 (pvr 0080 0200)
>   Thread(s) per core:     8
>   Core(s) per socket:     3
>   Socket(s):              8
> Virtualization features:  
>   Hypervisor vendor:      pHyp
>   Virtualization type:    para
> ---------------------------------------------------------------------
> 
> $ cpupower idle-info
> CPUidle driver: pseries_idle
> CPUidle governor: menu
> analyzing CPU 0:
> 
> Number of idle states: 2
> Available idle states: snooze CEDE
> snooze:
> Flags/Description: snooze
> 
> Latency: 0
> Residency: 0
> Usage: 6229
> Duration: 402142
> CEDE:
> Flags/Description: CEDE
> Latency: 12
> Residency: 120
> Usage: 191411
> Duration: 36329999037
> 
> ---------------------------------------------------------------------
> PostgreSQL Benchmark:
> ---------------------------------------------------------------------
> 
> I ran pgbench with 224 clients and 20 threads for 600 seconds,
> performing only SELECT queries against the pgbench database to 
> evaluate performance under read-intensive workloads:
> 
> $ pgbench -c 224 -j 20 -T 600 -S pgbench
> 
> Latency:
> 
> |---|-------------|------------|------------|
> | # | Before (ms) | After (ms) | Change (%) |
> |Run| Patch       | Patch      |            |
> |---|-------------|------------|------------|
> | 1 | 0.343       | 0.287      | -16.31%    |
> | 2 | 0.334       | 0.286      | -14.37%    |
> | 3 | 0.333       | 0.286      | -14.11%    |
> | 4 | 0.341       | 0.288      | -15.55%    |
> | 5 | 0.342       | 0.288      | -15.79%    |
> |---|-------------|------------|------------|
> 
> Latency Reduction: After applying the patch, the latency decreased
> by 14% to 16% across multiple runs.
> 
> Throughput per second:
> 
> |---|-------------|------------|------------|
> | # | Before      | After      | Change (%) |
> |Run| Patch       | Patch      |            |
> |---|-------------|------------|------------|
> | 1 | 652,544.23  | 780,613.42 | +19.63%    |
> | 2 | 670,243.45  | 784,348.44 | +17.04%    |
> | 3 | 673,495.39  | 784,458.12 | +16.48%    |
> | 4 | 656,609.16  | 778,531.20 | +18.57%    |
> | 5 | 654,644.52  | 778,322.88 | +18.88%    |
> |---|-------------|------------|------------|

Do you happen to have the idle stats here, too?
Especially above/below see comment below.

> 
> Transactions per second Improvement: The patch led to an increase in
> TPS, ranging from 16% to 19%.
> 
> This indicates that the patch significantly reduces latency while
> improving throughput (TPS).  pgbench is an OLTP benchmark and doesn't
> do timer based wakeups, this is in-line with the improvements
> I saw in the originally reported OLTP workload as well. 
> 
> ---------------------------------------------------------------------
> Additional CPU Idle test with custom benchmark:
> ---------------------------------------------------------------------
> I also wrote a simple benchmark [1] to analyze the impact of cpuidle
> state selection, comparing timer-based wakeups and non-timer
> (pipe-based) wakeups.
> 
> This test involves a single waker thread periodically waking up a
> single wakee thread, simulating a repeating sleep-wake pattern. The
> test was run with both timer-based and pipe-based wakeups, and cpuidle
> statistics (usage, time, above, below) were collected.

Thanks, pretty cool and just quickly checking it seems to be all there:
hrtimer, different cpusets for wakee and waker, the pipe wakeup.
That could be useful, only thing missing is the license header ;)

> 
> Timer based wakeup:
> 
> +------------------------+---------------------+---------------------+
> | Metric                 | Without Patch       | With Patch          |
> +------------------------+---------------------+---------------------+
> | Wakee thread - CPU     | 110                 | 110                 |
> | Waker thread - CPU     | 20                  | 20                  |
> | Sleep Interval         | 50 us               | 50 us               |
> | Total Wakeups          | -                   | -                   |
> | Avg Wakeup Latency     | -                   | -                   |
> | Actual Sleep time      | 52.639 us           | 52.627 us           |
> +------------------------+---------------------+---------------------+
> | Idle State 0 Usage Diff| 94,879              | 94,879              |
> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> | Idle State 1 Usage Diff| 0                   | 0                   |
> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
> +------------------------+---------------------+---------------------+
> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> +------------------------+---------------------+---------------------+
> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> +------------------------+---------------------+---------------------+
> 
> In timer-based wakeups, the menu governor effectively predicts the idle
> duration both with and without the patch. This ensures that there are
> few or no instances of "Above" usage, allowing the CPU to remain in the
> correct idle state.
> 
> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> governor ensures that a physical idle state is not prioritized when a
> timer event is expected before the target residency of the first physical
> idle state.
> 
> As a result, the patch has no impact in this case, and performance
> remains stable with timer based wakeups.
> 
> Pipe based wakeup (non-timer wakeup):
> 
> +------------------------+---------------------+---------------------+
> | Metric                 | Without Patch       | With Patch          |
> +------------------------+---------------------+---------------------+
> | Wakee thread - CPU     | 110                 | 110                 |
> | Waker thread - CPU     | 20                  | 20                  |
> | Sleep Interval         | 50 us               | 50 us               |
> | Total Wakeups          | 97031               | 96583               |
> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> | Actual Sleep time      | 51.366 us           | 51.605 us           |
> +------------------------+---------------------+---------------------+
> | Idle State 0 Usage Diff| 1209                | 96,586              |
> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> | Idle State 1 Usage Diff| 95,826              | 5                   |
> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> +------------------------+---------------------+---------------------+
> +------------------------+---------------------+---------------------+
> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> +------------------------+---------------------+---------------------+     
> +------------------------+---------------------+---------------------+
> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> +------------------------+---------------------+---------------------+
> 
> In the pipe-based wakeup scenario, before the patch was applied, the 
> "Above" metric was notably high around 98.75%. This suggests that the
> menu governor frequently selected a deeper idle state like CEDE, even
> when the idle period was relatively short.
> 
> This happened because the menu governor is inclined to prioritize the
> physical idle state (CEDE) even when the target residency time of the
> physical idle state (s->target_residency_ns) was longer than the
> predicted idle time (predicted_ns), so data->next_timer_ns won't be
> relevant here in non-timer wakeups.
> 
> In this test, despite the actual idle period being around 50 microseconds,
> the menu governor still chose CEDE state, which has a target residency of
> 120 microseconds.

And the idle_miss statistics confirms that this was mostly wrong decisions
in hindsight.
I'll go through the menu code again, this indeed shouldn't be happening.
I'd be very surprised if upstream teo performed as badly (or actually badly
at all) on this, although it doesn't seem to help your actual workload,
would you mind giving that a try too?

Regards,
Christian

> 
> ---------------------------------------------------------------------
> 
> While timer-based wakeups performed well even without the patch, workloads
> that don't have timers as wakeup source but have predictable idle durations
> shorter than the first idle state's target residency benefit significantly
> from the patch.
> 
> It will be helpful to understand why prioritizing deep physical idle
> states over shallow ones even for short idle periods that don’t meet
> the target residency like mentioned above is considered more beneficial.
> 
> Is there something I could be missing here?
> 
> Any comments or suggestions will be really helpful.
> 
> [1] https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> 
> Thanks,
> Aboorva
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ