[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <93d9ffb2-482d-49e0-8c67-b795256d961a@arm.com>
Date: Tue, 13 Aug 2024 13:56:11 +0100
From: Christian Loehle <christian.loehle@....com>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>, rafael@...nel.org,
daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring
physical over polling cpuidle state
On 8/9/24 08:31, Aboorva Devarajan wrote:
> This patch aims to discuss a potential performance degradation that can occur
> in certain workloads when the menu governor prioritizes selecting a physical
> idle state over a polling state for short idle durations.
>
> Note: This patch is intended to showcase a performance degradation, applying
> this patch could lead to increased power consumption due to the trade-off between
> performance and power efficiency, potentially causing a higher preference for
> performance at the expense of power usage.
>
Not really a menu expert, but at this point I don't know who dares call
themselves one.
The elephant in the room would be: Does teo work better for you?
> ==================================================
> System details in which the degradation is observed:
>
> $ uname -r
> 6.10.0+
>
> $ lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 160
> On-line CPU(s) list: 0-159
> Model name: POWER10 (architected), altivec supported
> Model: 2.0 (pvr 0080 0200)
> Thread(s) per core: 8
> Core(s) per socket: 3
> Socket(s): 6
> Physical sockets: 4
> Physical chips: 2
> Physical cores/chip: 6
> Virtualization features:
> Hypervisor vendor: pHyp
> Virtualization type: para
> Caches (sum of all):
> L1d: 1.3 MiB (40 instances)
> L1i: 1.9 MiB (40 instances)
> L2: 40 MiB (40 instances)
> L3: 160 MiB (40 instances)
> NUMA:
> NUMA node(s): 6
> NUMA node0 CPU(s): 0-31
> NUMA node1 CPU(s): 32-71
> NUMA node2 CPU(s): 72-79
> NUMA node3 CPU(s): 80-87
> NUMA node4 CPU(s): 88-119
> NUMA node5 CPU(s): 120-159
>
>
> $ cpupower idle-info
> CPUidle driver: pseries_idle
> CPUidle governor: menu
> analyzing CPU 0:
>
> Number of idle states: 2
> Available idle states: snooze CEDE
> snooze:
> Flags/Description: snooze
> Latency: 0
> Residency: 0
> Usage: 6229
> Duration: 402142
> CEDE:
> Flags/Description: CEDE
> Latency: 12
> Residency: 120
> Usage: 191411
> Duration: 36329999037
>
> ==================================================
>
> The menu governor contains a condition that selects physical idle states over,
> such as the CEDE state over polling state, by checking if their exit latency meets
> the latency requirements. This can lead to performance drops in workloads with
> frequent short idle periods.
>
> The specific condition which causes degradation is as below (menu governor):
>
> ```
> if (s->target_residency_ns > predicted_ns) {
> ...
> if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> s->exit_latency_ns <= latency_req &&
> s->target_residency_ns <= data->next_timer_ns) {
> predicted_ns = s->target_residency_ns;
> idx = i;
> break;
> }
> ...
> }
> ```
>
> This condition can cause the menu governor to choose the CEDE state on Power
> Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
> when the predicted idle duration is much shorter than the target residency
> of the physical state. This misprediction leads to performance degradation
> in certain workloads.
>
So clearly the condition
s->target_residency_ns <= data->next_timer_ns)
is supposed to prevent this, but data->next_timer_ns isn't accurate,
have you got any idea what it's set to in your workload usually?
Seems like your workload is timer-based, so the idle duration should be
predicted accurately.
> ==================================================
> Test Results
> ==================================================
>
> This issue can be clearly observed with the below test.
>
> A test with multiple wakee threads and a single waker thread was run to
> demonstrate this issue. The waker thread periodically wakes up the wakee
> threads after a specific sleep duration, creating a repeating of sleep -> wake
> pattern. The test was run for a stipulated period, and cpuidle statistics are
> collected.
>
> ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
>
> ==================================================
> Results (Baseline Kernel):
> ==================================================
> Wakee 0[PID 8295] affined to CPUs: 10,
> Wakee 2[PID 8297] affined to CPUs: 30,
> Wakee 3[PID 8298] affined to CPUs: 40,
> Wakee 1[PID 8296] affined to CPUs: 20,
> Wakee 4[PID 8299] affined to CPUs: 50,
> Wakee 5[PID 8300] affined to CPUs: 60,
> Wakee 6[PID 8301] affined to CPUs: 70,
> Waker[PID 8302] affined to CPUs: 0,
>
> |-----------------------------------|-------------------------|-----------------------------|
> | Metric | snooze | CEDE |
> |-----------------------------------|-------------------------|-----------------------------|
> | Usage | 47815 | 2030160 |
> | Above | 0 | 2030043 |
> | Below | 0 | 0 |
> | Time Spent (us) | 976317 (1.63%) | 51046474 (85.08%) |
> | Overall average sleep duration | 28.721 us | |
> | Overall average wakeup latency | 6.858 us | |
> |-----------------------------------|-------------------------|-----------------------------|
>
> In this test, without the patch, the CPU often enters the CEDE state for
> sleep durations of around 20-30 microseconds, even though the CEDE state's
> residency time is 120 microseconds. This happens because the menu governor
> prioritizes the physical idle state (CEDE) if its exit latency is within
> the latency limits. It also uses next_timer_ns for comparison, which can
> be farther off than the actual idle duration as it is more predictable,
> instead of using predicted idle duration as a comparision point with the
> target residency.
Ideally that shouldn't be the case though (next_timer_ns be farther off the
actual idle duration).
> [snip]
Powered by blists - more mailing lists