lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46f92f23aaff45a67a95aa8c83e3cf00bb96d3cf.camel@linux.ibm.com>
Date: Mon, 21 Oct 2024 13:10:17 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: Christian Loehle <christian.loehle@....com>, rafael@...nel.org,
        daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc: gautam@...ux.ibm.com
Subject: Re: [PATCH 0/1] cpuidle/menu: Address performance drop from
 favoring physical over polling cpuidle state

On Thu, 2024-09-19 at 10:43 +0100, Christian Loehle wrote:

> On 9/19/24 09:49, Christian Loehle wrote:
> > On 9/19/24 06:02, Aboorva Devarajan wrote:
> > > On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> > > Timer based wakeup:
> > > 
> > > +------------------------+---------------------+---------------------+
> > > > Metric                 | Without Patch       | With Patch          |
> > > +------------------------+---------------------+---------------------+
> > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > Waker thread - CPU     | 20                  | 20                  |
> > > > Sleep Interval         | 50 us               | 50 us               |
> > > > Total Wakeups          | -                   | -                   |
> > > > Avg Wakeup Latency     | -                   | -                   |
> > > > Actual Sleep time      | 52.639 us           | 52.627 us           |
> > > +------------------------+---------------------+---------------------+
> > > > Idle State 0 Usage Diff| 94,879              | 94,879              |
> > > > Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
> > > > Idle State 1 Usage Diff| 0                   | 0                   |
> > > > Idle State 1 Time Diff | 0 ns                | 0 ns                |
> > > +------------------------+---------------------+---------------------+
> > > > Total Above Usage      | 0 (0.00%)           | (0.00%)             |
> > > +------------------------+---------------------+---------------------+
> > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > +------------------------+---------------------+---------------------+
> > > 
> > > In timer-based wakeups, the menu governor effectively predicts the idle
> > > duration both with and without the patch. This ensures that there are
> > > few or no instances of "Above" usage, allowing the CPU to remain in the
> > > correct idle state.
> > > 
> > > The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
> > > governor ensures that a physical idle state is not prioritized when a
> > > timer event is expected before the target residency of the first physical
> > > idle state.
> > > 
> > > As a result, the patch has no impact in this case, and performance
> > > remains stable with timer based wakeups.
> > > 
> > > Pipe based wakeup (non-timer wakeup):
> > > 
> > > +------------------------+---------------------+---------------------+
> > > > Metric                 | Without Patch       | With Patch          |
> > > +------------------------+---------------------+---------------------+
> > > > Wakee thread - CPU     | 110                 | 110                 |
> > > > Waker thread - CPU     | 20                  | 20                  |
> > > > Sleep Interval         | 50 us               | 50 us               |
> > > > Total Wakeups          | 97031               | 96583               |
> > > > Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
> > > > Actual Sleep time      | 51.366 us           | 51.605 us           |
> > > +------------------------+---------------------+---------------------+
> > > > Idle State 0 Usage Diff| 1209                | 96,586              |
> > > > Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
> > > > Idle State 1 Usage Diff| 95,826              | 5                   |
> > > > Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
> > > +------------------------+---------------------+---------------------+
> > > +------------------------+---------------------+---------------------+
> > > > **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
> > > +------------------------+---------------------+---------------------+     
> > > +------------------------+---------------------+---------------------+
> > > > Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
> > > +------------------------+---------------------+---------------------+
> > > 
> > > In the pipe-based wakeup scenario, before the patch was applied, the 
> > > "Above" metric was notably high around 98.75%. This suggests that the
> > > menu governor frequently selected a deeper idle state like CEDE, even
> > > when the idle period was relatively short.
> > > 
> > > This happened because the menu governor is inclined to prioritize the
> > > physical idle state (CEDE) even when the target residency time of the
> > > physical idle state (s->target_residency_ns) was longer than the
> > > predicted idle time (predicted_ns), so data->next_timer_ns won't be
> > > relevant here in non-timer wakeups.
> > > 
> > > In this test, despite the actual idle period being around 50 microseconds,
> > > the menu governor still chose CEDE state, which has a target residency of
> > > 120 microseconds.
> > 
> > And the idle_miss statistics confirms that this was mostly wrong decisions
> > in hindsight.
> > I'll go through the menu code again, this indeed shouldn't be happening.
> > I'd be very surprised if upstream teo performed as badly (or actually badly
> > at all) on this, although it doesn't seem to help your actual workload,
> > would you mind giving that a try too?
> > 
> > Regards,
> > Christian
> 
> So with a workload as static as this, the recent interval detection of menu
> should take affect. It records the last 8 idle interval lengths and does some
> statistics do see if they are somewhat stable and uses those instead of the
> next timer event.
> While I wouldn't vouch for the statistics just yet, the results don't seem
> to be completely unreasonable.
> Do you mind trying

> a) printing some snapshots of these intervals and the
> resulting return value of get_typical_interval()?

Christian,

Here are the stats I get when printing the intervals and corresponding predicted_ns
for an idle duration of 50 us set from user space. In this case, a pipe wakeup with
a 50 us sleep time in user space translates to approximately 34 us of CPU idle time
in the kernel, with very little deviation.

-------------------------------------------------------------------------------------
process        cpu           timestamp   trace
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[2] = 34
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[3] = 34
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[4] = 34
<idle>-0       [200] d....   123.969564: menu_select: CPU[200]: Interval[5] = 34
<idle>-0       [200] d....   123.969565: menu_select: CPU[200]: Interval[6] = 34
<idle>-0       [200] d....   123.969565: menu_select: CPU[200]: Interval[7] = 34
<idle>-0       [200] d....   123.969565: menu_select: CPU[200]: Predicted ns = 34000
<idle>-0       [200] d....   123.969615: menu_select: CPU[200]: Interval[0] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[1] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[2] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[3] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[4] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[5] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[6] = 34
<idle>-0       [200] d....   123.969616: menu_select: CPU[200]: Interval[7] = 34
<idle>-0       [200] d....   123.969617: menu_select: CPU[200]: Predicted ns = 34000
...
...
...
...
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[0] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[1] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[2] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[3] = 34
<idle>-0       [200] d....   123.969667: menu_select: CPU[200]: Interval[4] = 34
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Interval[5] = 34
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Interval[6] = 34
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Interval[7] = 33
<idle>-0       [200] d....   123.969668: menu_select: CPU[200]: Predicted ns = 33000

This pattern repeats until I stop the workload.
----------------------------------------------------------------------
---------------

> b) Trying this out? Setting these values to some around 50us, that returns
> 50us for me as expected and therefore should not select an idle state above
> that.
> 
> -->8--
> 
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -112,6 +112,8 @@ struct menu_device {
>         int             interval_ptr;
>  };
>  
> +static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
> +
>  static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
>  {
>         int bucket = 0;
> @@ -177,7 +179,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
>         sum = 0;
>         divisor = 0;
>         for (i = 0; i < INTERVALS; i++) {
> -               unsigned int value = data->intervals[i];
> +               unsigned int value = intervals_dummy[i];
>                 if (value <= thresh) {
>                         sum += value;
>                         divisor++;
> @@ -200,7 +202,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
>         /* Then try to determine variance */
>         variance = 0;
>         for (i = 0; i < INTERVALS; i++) {
> -               unsigned int value = data->intervals[i];
> +               unsigned int value = intervals_dummy[i];
>                 if (value <= thresh) {
>                         int64_t diff = (int64_t)value - avg;
>                         variance += diff * diff;
> 

Yes, I tried the above patch and I still see the issue:

+-----------------------------+------------------+------------------+------------------+
|           Metric            |     ladder       |      menu        |      teo         |
+=============================+==================+==================+==================+
|           Wakeups           |     579817       |     579298       |     578359       |
+-----------------------------+------------------+------------------+------------------+
| Average wakeup latency (us) |      7.32        |     7.099        |     4.538        |
+-----------------------------+------------------+------------------+------------------+
|     Sleep interval (us)     |      51.722      |    51.768        |     51.852       |
+-----------------------------+------------------+------------------+------------------+
|     State 0 Usage diff      |      0           |     7056         |     578402       |
+-----------------------------+------------------+------------------+------------------+
|   State 0 Time diff (ns)    |      0           |    350579        |    2.75562e+07   |
+-----------------------------+------------------+------------------+------------------+
|     State 0 Above diff      |      0           |      0           |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 0 Below diff      |      0           |      1           |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 1 Usage diff      |    580239        |    572168        |      0           |
+-----------------------------+------------------+------------------+------------------+
|   State 1 Time diff (ns)    |   2.74763e+07    |   2.73917e+07    |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 1 Above diff      |    580140        |    572070        |      0           |
+-----------------------------+------------------+------------------+------------------+
|     State 1 Below diff      |      0           |      0           |      0           |
+-----------------------------+------------------+------------------+------------------+
|         Total Above         |    580140        |    572070        |      0           |
+-----------------------------+------------------+------------------+------------------+
|         Total Below         |      0           |      1           |      0           |
+=============================+==================+==================+==================+
|     Total Above Percent     |     99.98        |     98.76        |      0           |
+-----------------------------+------------------+------------------+------------------+
|     Total Below Percent     |      0           |      0           |      0           |
+======================================================================================+

--------------------------------------------------------------------------------------------
Note:

% Total Above Percent (w.r.t. Total Usage) = (Total Above / Total Usage) * 100
% Total Below Percent (w.r.t. Total Usage) = (Total Below / Total Usage) * 100
--------------------------------------------------------------------------------------------

This confirms that the issue is not occuring due to the way in which predicted_ns is
computed in get_typicial_interval, but rather related to the menu governor's CPU idle
state selection logic.

Thanks,
Aboorva


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ