[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <964275d1-5f31-4eef-8dcd-fa8f11dbb830@arm.com>
Date: Thu, 19 Sep 2024 15:07:09 +0100
From: Christian Loehle <christian.loehle@....com>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>, rafael@...nel.org,
daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring
physical over polling cpuidle state
On 9/19/24 10:43, Christian Loehle wrote:
> On 9/19/24 09:49, Christian Loehle wrote:
>> On 9/19/24 06:02, Aboorva Devarajan wrote:
>>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
>>
>>>
>>> Timer based wakeup:
>>>
>>> +------------------------+---------------------+---------------------+
>>> | Metric | Without Patch | With Patch |
>>> +------------------------+---------------------+---------------------+
>>> | Wakee thread - CPU | 110 | 110 |
>>> | Waker thread - CPU | 20 | 20 |
>>> | Sleep Interval | 50 us | 50 us |
>>> | Total Wakeups | - | - |
>>> | Avg Wakeup Latency | - | - |
>>> | Actual Sleep time | 52.639 us | 52.627 us |
>>> +------------------------+---------------------+---------------------+
>>> | Idle State 0 Usage Diff| 94,879 | 94,879 |
>>> | Idle State 0 Time Diff | 4,700,323 ns | 4,697,576 ns |
>>> | Idle State 1 Usage Diff| 0 | 0 |
>>> | Idle State 1 Time Diff | 0 ns | 0 ns |
>>> +------------------------+---------------------+---------------------+
>>> | Total Above Usage | 0 (0.00%) | (0.00%) |
>>> +------------------------+---------------------+---------------------+
>>> | Total Below Usage | 0 (0.00%) | 0 (0.00%) |
>>> +------------------------+---------------------+---------------------+
>>>
>>> In timer-based wakeups, the menu governor effectively predicts the idle
>>> duration both with and without the patch. This ensures that there are
>>> few or no instances of "Above" usage, allowing the CPU to remain in the
>>> correct idle state.
>>>
>>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
>>> governor ensures that a physical idle state is not prioritized when a
>>> timer event is expected before the target residency of the first physical
>>> idle state.
>>>
>>> As a result, the patch has no impact in this case, and performance
>>> remains stable with timer based wakeups.
>>>
>>> Pipe based wakeup (non-timer wakeup):
>>>
>>> +------------------------+---------------------+---------------------+
>>> | Metric | Without Patch | With Patch |
>>> +------------------------+---------------------+---------------------+
>>> | Wakee thread - CPU | 110 | 110 |
>>> | Waker thread - CPU | 20 | 20 |
>>> | Sleep Interval | 50 us | 50 us |
>>> | Total Wakeups | 97031 | 96583 |
>>> | Avg Wakeup Latency | 7.070 us | 4.879 us |
>>> | Actual Sleep time | 51.366 us | 51.605 us |
>>> +------------------------+---------------------+---------------------+
>>> | Idle State 0 Usage Diff| 1209 | 96,586 |
>>> | Idle State 0 Time Diff | 55,579 ns | 4,510,003 ns |
>>> | Idle State 1 Usage Diff| 95,826 | 5 |
>>> | Idle State 1 Time Diff | 4,522,639 ns | 198 ns |
>>> +------------------------+---------------------+---------------------+
>>> +------------------------+---------------------+---------------------+
>>> | **Total Above Usage** | 95,824 (98.75%) | 5 (0.01%) |
>>> +------------------------+---------------------+---------------------+
>>> +------------------------+---------------------+---------------------+
>>> | Total Below Usage | 0 (0.00%) | 0 (0.00%) |
>>> +------------------------+---------------------+---------------------+
>>>
>>> In the pipe-based wakeup scenario, before the patch was applied, the
>>> "Above" metric was notably high around 98.75%. This suggests that the
>>> menu governor frequently selected a deeper idle state like CEDE, even
>>> when the idle period was relatively short.
>>>
>>> This happened because the menu governor is inclined to prioritize the
>>> physical idle state (CEDE) even when the target residency time of the
>>> physical idle state (s->target_residency_ns) was longer than the
>>> predicted idle time (predicted_ns), so data->next_timer_ns won't be
>>> relevant here in non-timer wakeups.
>>>
>>> In this test, despite the actual idle period being around 50 microseconds,
>>> the menu governor still chose CEDE state, which has a target residency of
>>> 120 microseconds.
>>
>> And the idle_miss statistics confirms that this was mostly wrong decisions
>> in hindsight.
>> I'll go through the menu code again, this indeed shouldn't be happening.
>> I'd be very surprised if upstream teo performed as badly (or actually badly
>> at all) on this, although it doesn't seem to help your actual workload,
>> would you mind giving that a try too?
>>
>> Regards,
>> Christian
>
> So with a workload as static as this, the recent interval detection of menu
> should take affect. It records the last 8 idle interval lengths and does some
> statistics do see if they are somewhat stable and uses those instead of the
> next timer event.
> While I wouldn't vouch for the statistics just yet, the results don't seem
> to be completely unreasonable.
> Do you mind trying a) printing some snapshots of these intervals and the
> resulting return value of get_typical_interval()?
> b) Trying this out? Setting these values to some around 50us, that returns
> 50us for me as expected and therefore should not select an idle state above
> that.
>
> -->8--
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -112,6 +112,8 @@ struct menu_device {
> int interval_ptr;
> };
>
> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
So just to illustrate this a bit more:
{51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last:
{51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned)
That sounds about reasonable but I think something like that would
also be observable for you. It's an unfortunate edge case.
I wouldn't want to mess with the ancient statistics (and silently break
an unknown amount of setups). I much prefer the teo intercept approach that
makes this idle state dependent and doesn't try to get one formula right
for all setups.
That is all assuming that a) get_typical_interval() doesn't return a
significant interval length for you and b) replacing that with dummy
significant values then fixes it (and not the error being some menu
logic that I overlooked).
Thinking out loud regarding the menu logic:
If we have ~50us predicted interval and your state setup the code then
goes as follows:
predicted_ns = 50000
...
data->next_timer_ns = KTIME_MAX;
delta_tick = TICK_NSEC / 2;
data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably
And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000:
trivially we have for i = 0:
if (idx == -1)
idx = i; /* first enabled state */
and for i = 1:
if (s->target_residency_ns > predicted_ns) { // True
...
if (predicted_ns < TICK_NSEC) // True
break;
and should end up with idx == 0 and *stop_tick = false, all good.
If my above assumptions in a) and b) do not hold then I must've made a
mistake reading the code just now.
Powered by blists - more mailing lists