linux-kernel - Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9e5ef8ab0f0f3e7cb128291cd60591e3d07b33e4.camel@linux.ibm.com>
Date: Tue, 20 Aug 2024 14:21:36 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: Christian Loehle <christian.loehle@....com>, rafael@...nel.org,
        daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com, aboorvad@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from
 favoring physical over polling cpuidle state

On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:

Hi Christian,

Thanks a lot for your comments.
...
> On 8/9/24 08:31, Aboorva Devarajan wrote:
> > This patch aims to discuss a potential performance degradation that can occur
> > in certain workloads when the menu governor prioritizes selecting a physical
> > idle state over a polling state for short idle durations. 
> > 
> > Note: This patch is intended to showcase a performance degradation, applying
> > this patch could lead to increased power consumption due to the trade-off between
> > performance and power efficiency, potentially causing a higher preference for
> > performance at the expense of power usage.
> > 
> 
> Not really a menu expert, but at this point I don't know who dares call
> themselves one.
> The elephant in the room would be: Does teo work better for you?
> 

I ran some tests with the teo governor enabled, but it didn’t make a
lot of difference. The results are presented below.

> > ==================================================
> > System details in which the degradation is observed:
> > 
> > $ uname -r
> > 6.10.0+
> > 
> > $ lscpu
> > Architecture:             ppc64le
> >   Byte Order:             Little Endian
> > CPU(s):                   160
> >   On-line CPU(s) list:    0-159
> > Model name:               POWER10 (architected), altivec supported
> >   Model:                  2.0 (pvr 0080 0200)
> >   Thread(s) per core:     8
> >   Core(s) per socket:     3
> >   Socket(s):              6
> >   Physical sockets:       4
> >   Physical chips:         2
> >   Physical cores/chip:    6
> > Virtualization features:
> >   Hypervisor vendor:      pHyp
> >   Virtualization type:    para
> > Caches (sum of all):
> >   L1d:                    1.3 MiB (40 instances)
> >   L1i:                    1.9 MiB (40 instances)
> >   L2:                     40 MiB (40 instances)
> >   L3:                     160 MiB (40 instances)
> > NUMA:
> >   NUMA node(s):           6
> >   NUMA node0 CPU(s):      0-31
> >   NUMA node1 CPU(s):      32-71
> >   NUMA node2 CPU(s):      72-79
> >   NUMA node3 CPU(s):      80-87
> >   NUMA node4 CPU(s):      88-119
> >   NUMA node5 CPU(s):      120-159
> > 
> > 
> > $ cpupower idle-info
> > CPUidle driver: pseries_idle
> > CPUidle governor: menu
> > analyzing CPU 0:
> > 
> > Number of idle states: 2
> > Available idle states: snooze CEDE
> > snooze:
> > Flags/Description: snooze
> > Latency: 0
> > Residency: 0
> > Usage: 6229
> > Duration: 402142
> > CEDE:
> > Flags/Description: CEDE
> > Latency: 12
> > Residency: 120
> > Usage: 191411
> > Duration: 36329999037
> > 
> > ==================================================
> > 
> > The menu governor contains a condition that selects physical idle states over,
> > such as the CEDE state over polling state, by checking if their exit latency meets
> > the latency requirements. This can lead to performance drops in workloads with
> > frequent short idle periods.
> > 
> > The specific condition which causes degradation is as below (menu governor):
> > 
> > ```
> > if (s->target_residency_ns > predicted_ns) {
> >     ...
> >     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> >         s->exit_latency_ns <= latency_req &&
> >         s->target_residency_ns <= data->next_timer_ns) {
> >         predicted_ns = s->target_residency_ns;
> >         idx = i;
> >         break;
> >     }
> >     ...
> > }
> > ```
> > 
> > This condition can cause the menu governor to choose the CEDE state on Power
> > Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
> > when the predicted idle duration is much shorter than the target residency
> > of the physical state. This misprediction leads to performance degradation
> > in certain workloads.
> > 
> 
> So clearly the condition
> s->target_residency_ns <= data->next_timer_ns)
> is supposed to prevent this, but data->next_timer_ns isn't accurate,
> have you got any idea what it's set to in your workload usually?
> Seems like your workload is timer-based, so the idle duration should be
> predicted accurately.
> 

Yes, that's right ideally this condition should have prevented this,
but `data->next_timer_ns` is almost always greater than the actual
idle duration which seems inaccurate.

> 
> > ==================================================
> > Test Results
> > ==================================================
> > 
> > This issue can be clearly observed with the below test.
> > 
> > A test with multiple wakee threads and a single waker thread was run to
> > demonstrate this issue. The waker thread periodically wakes up the wakee
> > threads after a specific sleep duration, creating a repeating of sleep -> wake
> > pattern. The test was run for a stipulated period, and cpuidle statistics are
> > collected.
> > 
> > ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
> > 
> > ==================================================
> > Results (Baseline Kernel):
> > ==================================================
> > Wakee 0[PID 8295] affined to CPUs: 10,
> > Wakee 2[PID 8297] affined to CPUs: 30,
> > Wakee 3[PID 8298] affined to CPUs: 40,
> > Wakee 1[PID 8296] affined to CPUs: 20,
> > Wakee 4[PID 8299] affined to CPUs: 50,
> > Wakee 5[PID 8300] affined to CPUs: 60,
> > Wakee 6[PID 8301] affined to CPUs: 70,
> > Waker[PID 8302] affined to CPUs: 0,
> > 
> > > -----------------------------------|-------------------------|-----------------------------|
> > > Metric                            | snooze                  | CEDE                        |
> > > -----------------------------------|-------------------------|-----------------------------|
> > > Usage                             | 47815                   | 2030160                     |
> > > Above                             | 0                       | 2030043                     |
> > > Below                             | 0                       | 0                           |
> > > Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
> > > Overall average sleep duration    | 28.721 us               |                             |
> > > Overall average wakeup latency    | 6.858 us                |                             |
> > > -----------------------------------|-------------------------|-----------------------------|
> > 
> > In this test, without the patch, the CPU often enters the CEDE state for
> > sleep durations of around 20-30 microseconds, even though the CEDE state's
> > residency time is 120 microseconds. This happens because the menu governor
> > prioritizes the physical idle state (CEDE) if its exit latency is within
> > the latency limits. It also uses next_timer_ns for comparison, which can
> > be farther off than the actual idle duration as it is more predictable,
> > instead of using predicted idle duration as a comparision point with the
> > target residency.
> 
> Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration)

I ran some experiments based on your suggestions. Below are the
relative average runtimes and percentage differences compared to
the base version:

Picked a single representative workload for simplicity:

Note: Lower (% Difference) the better.
|---------------------------|-----------------|--------------|
| Configuration             | Average Runtime | % Difference |
|---------------------------|-----------------|--------------|
| Base (menu)               | 1.00            | 0.00%        |
| Base + Patch 1 (menu)     | 0.92            | -8.00%       |
| Base + Patch 2 (menu)     | 0.98            | -2.00%       |
| Base (teo)                | 1.01            | +1.00%       |
|---------------------------|-----------------|--------------|
Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/  
Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/

It seems that Patch 2 does provide a slight improvement in runtime, but
not significantly like Patch 1. Additionally, teo does not seem
to help in this case.

Regarding the condition `s->target_residency_ns <= data->next_timer_ns`,
it appears that `data->next_timer_ns` is consistently greater than
both actual idle duration and `s->target_residency_ns` so this condition
nearly always holds true, indicating some inaccuracy. I'll investigate
this further and follow up with more details.

Regards,
Aboorva