linux-kernel - Re: [PATCH 0/1] cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96960c17eb8325973dcb907fa19533ee1130e2a2.camel@linux.ibm.com>
Date: Mon, 21 Oct 2024 13:32:15 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: Christian Loehle <christian.loehle@....com>, rafael@...nel.org,
        daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com, aboorvad@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from
 favoring physical over polling cpuidle state

On Thu, 2024-09-19 at 15:38 +0100, Christian Loehle wrote:
> On 9/19/24 15:07, Christian Loehle wrote:
> > On 9/19/24 10:43, Christian Loehle wrote:
> > > On 9/19/24 09:49, Christian Loehle wrote:
> > > > On 9/19/24 06:02, Aboorva Devarajan wrote:
> > > > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> > > > > Timer based wakeup:
> > > > > 
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Metric                 | Without Patch       | With Patch          |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > > > Waker thread - CPU     | 20                  | 20                  |
> > > > > > Sleep Interval         | 50 us               | 50 us               |
> > > > > > Total Wakeups          | -                   | -                   |
> > > > > > Avg Wakeup Latency     | -                   | -                   |
> > > > > > Actual Sleep time      | 52.639 us           | 52.627 us           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Idle State 0 Usage Diff| 94,879              | 94,879              |
> > > > > > Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> > > > > > Idle State 1 Usage Diff| 0                   | 0                   |
> > > > > > Idle State 1 Time Diff | 0 ns                | 0 ns                |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > 
> > > > > In timer-based wakeups, the menu governor effectively predicts the idle
> > > > > duration both with and without the patch. This ensures that there are
> > > > > few or no instances of "Above" usage, allowing the CPU to remain in the
> > > > > correct idle state.
> > > > > 
> > > > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > > > > governor ensures that a physical idle state is not prioritized when a
> > > > > timer event is expected before the target residency of the first physical
> > > > > idle state.
> > > > > 
> > > > > As a result, the patch has no impact in this case, and performance
> > > > > remains stable with timer based wakeups.
> > > > > 
> > > > > Pipe based wakeup (non-timer wakeup):
> > > > > 
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Metric                 | Without Patch       | With Patch          |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > > > Waker thread - CPU     | 20                  | 20                  |
> > > > > > Sleep Interval         | 50 us               | 50 us               |
> > > > > > Total Wakeups          | 97031               | 96583               |
> > > > > > Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> > > > > > Actual Sleep time      | 51.366 us           | 51.605 us           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Idle State 0 Usage Diff| 1209                | 96,586              |
> > > > > > Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> > > > > > Idle State 1 Usage Diff| 95,826              | 5                   |
> > > > > > Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> > > > > +------------------------+---------------------+---------------------+
> > > > > +------------------------+---------------------+---------------------+
> > > > > > **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> > > > > +------------------------+---------------------+---------------------+     
> > > > > +------------------------+---------------------+---------------------+
> > > > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > > > +------------------------+---------------------+---------------------+
> > > > > 
> > > > > In the pipe-based wakeup scenario, before the patch was applied, the 
> > > > > "Above" metric was notably high around 98.75%. This suggests that the
> > > > > menu governor frequently selected a deeper idle state like CEDE, even
> > > > > when the idle period was relatively short.
> > > > > 
> > > > > This happened because the menu governor is inclined to prioritize the
> > > > > physical idle state (CEDE) even when the target residency time of the
> > > > > physical idle state (s->target_residency_ns) was longer than the
> > > > > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > > > > relevant here in non-timer wakeups.
> > > > > 
> > > > > In this test, despite the actual idle period being around 50 microseconds,
> > > > > the menu governor still chose CEDE state, which has a target residency of
> > > > > 120 microseconds.
> > > > 
> > > > And the idle_miss statistics confirms that this was mostly wrong decisions
> > > > in hindsight.
> > > > I'll go through the menu code again, this indeed shouldn't be happening.
> > > > I'd be very surprised if upstream teo performed as badly (or actually badly
> > > > at all) on this, although it doesn't seem to help your actual workload,
> > > > would you mind giving that a try too?
> > > > 
> > > > Regards,
> > > > Christian
> > > 
> > > So with a workload as static as this, the recent interval detection of menu
> > > should take affect. It records the last 8 idle interval lengths and does some
> > > statistics do see if they are somewhat stable and uses those instead of the
> > > next timer event.
> > > While I wouldn't vouch for the statistics just yet, the results don't seem
> > > to be completely unreasonable.
> > > Do you mind trying a) printing some snapshots of these intervals and the
> > > resulting return value of get_typical_interval()?
> > > b) Trying this out? Setting these values to some around 50us, that returns
> > > 50us for me as expected and therefore should not select an idle state above
> > > that.
> > > 
> > > -->8--
> > > 
> > > --- a/drivers/cpuidle/governors/menu.c
> > > +++ b/drivers/cpuidle/governors/menu.c
> > > @@ -112,6 +112,8 @@ struct menu_device {
> > >         int             interval_ptr;
> > >  };
> > >  
> > > +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
> > 
> > So just to illustrate this a bit more:
> > {51, 80, 90, 54, 72, 60, 80, 117} is insignificant, but decrementing the last:
> > {51, 80, 90, 54, 72, 60, 80, 116} is significant (75us returned)
> > That sounds about reasonable but I think something like that would
> > also be observable for you. It's an unfortunate edge case.
> > I wouldn't want to mess with the ancient statistics (and silently break
> > an unknown amount of setups). I much prefer the teo intercept approach that
> > makes this idle state dependent and doesn't try to get one formula right
> > for all setups.
> > 
> > That is all assuming that a) get_typical_interval() doesn't return a
> > significant interval length for you and b) replacing that with dummy
> > significant values then fixes it (and not the error being some menu
> > logic that I overlooked).
> > 
> > Thinking out loud regarding the menu logic:
> > If we have ~50us predicted interval and your state setup the code then
> > goes as follows:
> > 
> > predicted_ns = 50000
> > ...
> > data->next_timer_ns = KTIME_MAX; 
> > delta_tick = TICK_NSEC / 2;  
> > data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); // bucket highest state probably
> > 
> > And then we have the big for loop for state 0 POLLING and state 1 CEDE target_residency_ns==100000:
> > trivially we have for i = 0:
> > 	if (idx == -1)                                                  
> > 		idx = i; /* first enabled state */
> > and for i = 1:
> > 	if (s->target_residency_ns > predicted_ns) { // True
> > ...
> > 		if (predicted_ns < TICK_NSEC) // True                      
> > 			break;
> > and should end up with idx == 0 and *stop_tick = false, all good.
> > If my above assumptions in a) and b) do not hold then I must've made a
> > mistake reading the code just now.
> > 
> 
> And of course I realise right after sending I did indeed make a mistake.
> RESIDENCY_THRESHOLD_NS is 15us so timer is actually polled and the fields
> populated accurately, but also in the for loop it checks polling for idx == 0.
> predicted_ns should still be the value from get_typical_interval(), but
> that is unable to ever enforce a polling state, making your proposed code
> change the only obvious fix.

> Essentially with only one non-polling state the get_typical_interval()
> logic might as well be skipped, it never affects anything.
> 
> If I'm correct this time then I'm not really sure how to proceed.
> On the one hand, if we trust predicted_ns we should indeed select a state
> with predicted_ns < s->target_residency_ns, no matter if polling or not,
> but that would be quite a behaviour change for x86 users too.

That's right, so far this is the only solution I can think of that
improves performance, But, I'm open to trying alternative suggestions
if any.
> 
> I'm still curious on why teo wouldn't solve your problem, it's intercept
> logic doesn't care if it selects a polling state or not because of it.
> 
I confirmed that "teo" doesn't have the reported issue, and when compared
to the  vanilla menu governor switching to "teo" does improve the performance
of some workloads, though not the original workload I tested. I'll look into
why that's the case.

However, we would still want to resolve the issue with the menu governor,
as it's the default governor currently enabled and have been tested so
far. While switching to "teo" could be an option in the future.

Thanks,
Aboorva