linux-kernel - Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <964275d1-5f31-4eef-8dcd-fa8f11dbb830@arm.com>
Date: Thu, 19 Sep 2024 15:07:09 +0100
From: Christian Loehle <christian.loehle@....com>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>, rafael@...nel.org,
 daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
 linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring
 physical over polling cpuidle state

On 9/19/24 10:43, Christian Loehle wrote:
> On 9/19/24 09:49, Christian Loehle wrote:
>> On 9/19/24 06:02, Aboorva Devarajan wrote:
>>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
>>
>>>
>>> Timer based wakeup:
>>>
>>> +------------------------+---------------------+---------------------+
>>> | Metric                 | Without Patch       | With Patch          |
>>> +------------------------+---------------------+---------------------+
>>> | Wakee thread - CPU     | 110                 | 110                 |
>>> | Waker thread - CPU     | 20                  | 20                  |
>>> | Sleep Interval         | 50 us               | 50 us               |
>>> | Total Wakeups          | -                   | -                   |
>>> | Avg Wakeup Latency     | -                   | -                   |
>>> | Actual Sleep time      | 52.639 us           | 52.627 us           |
>>> +------------------------+---------------------+---------------------+
>>> | Idle State 0 Usage Diff| 94,879              | 94,879              |
>>> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
>>> | Idle State 1 Usage Diff| 0                   | 0                   |
>>> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
>>> +------------------------+---------------------+---------------------+
>>> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
>>> +------------------------+---------------------+---------------------+
>>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>>> +------------------------+---------------------+---------------------+
>>>
>>> In timer-based wakeups, the menu governor effectively predicts the idle
>>> duration both with and without the patch. This ensures that there are
>>> few or no instances of "Above" usage, allowing the CPU to remain in the
>>> correct idle state.
>>>
>>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
>>> governor ensures that a physical idle state is not prioritized when a
>>> timer event is expected before the target residency of the first physical
>>> idle state.
>>>
>>> As a result, the patch has no impact in this case, and performance
>>> remains stable with timer based wakeups.
>>>
>>> Pipe based wakeup (non-timer wakeup):
>>>
>>> +------------------------+---------------------+---------------------+
>>> | Metric                 | Without Patch       | With Patch          |
>>> +------------------------+---------------------+---------------------+
>>> | Wakee thread - CPU     | 110                 | 110                 |
>>> | Waker thread - CPU     | 20                  | 20                  |
>>> | Sleep Interval         | 50 us               | 50 us               |
>>> | Total Wakeups          | 97031               | 96583               |
>>> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
>>> | Actual Sleep time      | 51.366 us           | 51.605 us           |
>>> +------------------------+---------------------+---------------------+
>>> | Idle State 0 Usage Diff| 1209                | 96,586              |
>>> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
>>> | Idle State 1 Usage Diff| 95,826              | 5                   |
>>> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
>>> +------------------------+---------------------+---------------------+
>>> +------------------------+---------------------+---------------------+
>>> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
>>> +------------------------+---------------------+---------------------+     
>>> +------------------------+---------------------+---------------------+
>>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>>> +------------------------+---------------------+---------------------+
>>>
>>> In the pipe-based wakeup scenario, before the patch was applied, the 
>>> "Above" metric was notably high around 98.75%. This suggests that the
>>> menu governor frequently selected a deeper idle state like CEDE, even
>>> when the idle period was relatively short.
>>>
>>> This happened because the menu governor is inclined to prioritize the
>>> physical idle state (CEDE) even when the target residency time of the
>>> physical idle state (s->target_residency_ns) was longer than the
>>> predicted idle time (predicted_ns), so data->next_timer_ns won't be
>>> relevant here in non-timer wakeups.
>>>
>>> In this test, despite the actual idle period being around 50 microseconds,
>>> the menu governor still chose CEDE state, which has a target residency of
>>> 120 microseconds.
>>
>> And the idle_miss statistics confirms that this was mostly wrong decisions
>> in hindsight.
>> I'll go through the menu code again, this indeed shouldn't be happening.
>> I'd be very surprised if upstream teo performed as badly (or actually badly
>> at all) on this, although it doesn't seem to help your actual workload,
>> would you mind giving that a try too?
>>
>> Regards,
>> Christian
> 
> So with a workload as static as this, the recent interval detection of menu
> should take affect. It records the last 8 idle interval lengths and does some
> statistics do see if they are somewhat stable and uses those instead of the
> next timer event.
> While I wouldn't vouch for the statistics just yet, the results don't seem
> to be completely unreasonable.
> Do you mind trying a) printing some snapshots of these intervals and the
> resulting return value of get_typical_interval()?
> b) Trying this out? Setting these values to some around 50us, that returns
> 50us for me as expected and therefore should not select an idle state above
> that.
> 
> -->8--
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -112,6 +112,8 @@ struct menu_device {
>         int             interval_ptr;
>  };
>  
> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};

So just to illustrate this a bit more:
{51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last:
{51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned)
That sounds about reasonable but I think something like that would
also be observable for you. It's an unfortunate edge case.
I wouldn't want to mess with the ancient statistics (and silently break
an unknown amount of setups). I much prefer the teo intercept approach that
makes this idle state dependent and doesn't try to get one formula right
for all setups.

That is all assuming that a) get_typical_interval() doesn't return a
significant interval length for you and b) replacing that with dummy
significant values then fixes it (and not the error being some menu
logic that I overlooked).

Thinking out loud regarding the menu logic:
If we have ~50us predicted interval and your state setup the code then
goes as follows:

predicted_ns = 50000
...
data->next_timer_ns = KTIME_MAX; 
delta_tick = TICK_NSEC / 2;  
data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably

And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000:
trivially we have for i = 0:
	if (idx == -1)                                                  
		idx = i; /* first enabled state */
and for i = 1:
	if (s->target_residency_ns > predicted_ns) { // True
...
		if (predicted_ns < TICK_NSEC) // True                      
			break;
and should end up with idx == 0 and *stop_tick = false, all good.
If my above assumptions in a) and b) do not hold then I must've made a
mistake reading the code just now.