linux-kernel - Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful information when processing recent idle intervals

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0c013d5d2a9251d5dc468446f2a08ae8a7a8953.camel@linux.ibm.com>
Date: Mon, 24 Feb 2025 11:57:22 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Linux PM
 <linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
        Daniel Lezcano
	 <daniel.lezcano@...aro.org>,
        Christian Loehle <christian.loehle@....com>,
        Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>,
        aboorvad@...ux.ibm.com
Subject: Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful
 information when processing recent idle intervals

On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:

> Hi Everyone,
> 
> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> reduced kernel overhead.  Indeed, it was found during further investigation
> that the total interrupt rate while running the SPECjbb workload had fallen as
> a result of that commit by 55% and the local timer interrupt rate had fallen by
> almost 80%.
> 
> That turned out to cause the menu cpuidle governor to select the deepest idle
> state supplied by the cpuidle driver (intel_idle) much more often which added
> significantly more idle state latency to the workload and that led to the
> decrease of the critical-jOPS score.
> 
> Interestingly enough, this problem was not visible when the teo cpuidle
> governor was used instead of menu, so it appeared to be specific to the
> latter.  CPU wakeup event statistics collected while running the workload
> indicated that the menu governor was effectively ignoring non-timer wakeup
> information and all of its idle state selection decisions appeared to be
> based on timer wakeups only.  Thus, it appeared that the reduction of the
> local timer interrupt rate caused the governor to predict a idle duration
> much more often while running the workload and the deepest idle state was
> selected significantly more often as a result of that.
> 
> A subsequent inspection of the get_typical_interval() function in the menu
> governor indicated that it might return UINT_MAX too often which then caused
> the governor's decisions to be based entirely on information related to timers.
> 
> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> cannot make a prediction based on the most recent idle intervals data with
> sufficiently high confidence, but at least in some cases this means that
> useful information is not taken into account at all which may lead to
> significant idle state selection mistakes.  Moreover, this is not really
> unlikely to happen.
> 
> One issue with get_typical_interval() is that, when it eliminates outliers from
> the sample set in an attempt to reduce the standard deviation (and so improve
> the prediction confidence), it does that by dropping high-end samples only,
> while samples at the low end of the set are retained.  However, the samples
> at the low end very well may be the outliers and they should be eliminated
> from the sample set instead of the high-end samples.  Accordingly, the
> likelihood of making a meaningful idle duration prediction can be improved
> by making it also eliminate low-end samples if they are farther from the
> average than high-end samples.  This is done in patch [4/5].
> 
> Another issue is that get_typical_interval() gives up after eliminating 1/4
> of the samples if the standard deviation is still not as low as desired (within
> 1/6 of the average or within 20 us if the average is close to 0), but the
> remaining samples in the set still represent useful information at that point
> and discarding them altogether may lead to suboptimal idle state selection.
> 
> For instance, the largest idle duration value in the get_typical_interval()
> data set is the maximum idle duration observed recently and it is likely that
> the upcoming idle duration will not exceed it.  Therefore, in the absence of
> a better choice, this value can be used as an upper bound on the target
> residency of the idle state to select.  Patch [5/5] works along these lines,
> but it takes the maximum data point remaining after the elimination of
> outliers.
> 
> The first two patches in the series are straightforward cleanups (in fact,
> the first patch is kind of reversed by patch [4/5], but it is there because
> it can be applied without the latter) and patch [3/5] is a cosmetic change
> made in preparation for the subsequent ones.
> 
> This series turns out to restore the SPECjbb critical-jOPS metric on affected
> systems to the level from before commit 0611a640e60a and it also happens to
> increase its max-jOPS metric by around 3%.
> 
> For easier reference/testing it is present in the git branch at
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> 
> based on the cpuidle material that went into 6.14-rc1.
> 
> If possible, please let me know if it works for you.
> 
> Thanks!
> 
> 
> [1] Link: https://www.spec.org/jbb2015/
> 
> 
> 

Hi Rafael,

I ran some tests using a cpuidle microbenchmark that I have [1]:

The test does a uniform distribution of idle durations, which means the past eight idle intervals approximately match 
the sleep duration. So as anticipated, this change-set does not impact this case, and the behavior remains mostly
consistent even after applying the patch.

 Pipe based wakeup - Above %
 ----------------------------------------------------------------------
| Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5               | 0.00%            | 0.00%        | 0.00%            |
| 10              | 0.00%            | 0.00%        | 0.00%            |
------------------------------------------------------------------------
| 30              | 99.97%           | 99.96%       | 0.00%            |
| 100             | 98.61%           | 98.67%       | 0.00%            |
| 120             | 97.88%           | 98.42%       | 1.03%            | -> (*)
------------------------------------------------------------------------
| 130             | 2.82%            | 1.26%        | 0.22%            |
| 150             | 1.68%            | 1.43%        | 0.32%            |
| 200             | 2.09%            | 1.91%        | 0.15%            |
| 300             | 1.22%            | 1.22%        | 0.19%            |
| 400             | 1.20%            | 1.21%        | 0.19%            |
| 500             | 1.16%            | 1.12%        | 0.12%            |
------------------------------------------------------------------------

 Pipe based wakeup - Below %
 -----------------------------------------------------------------------
| Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5               | 0.00%            | 0.00%        | 0.00%            |
| 10              | 0.00%            | 0.00%        | 0.00%            |
| 30              | 0.00%            | 0.00%        | 0.00%            |
| 100             | 0.00%            | 0.00%        | 0.00%            |
| 120             | 2.76%            | 1.14%        | 0.93%            |
| 130             | 3.11%            | 1.13%        | 0.12%            |
| 150             | 1.34%            | 1.16%        | 0.18%            |
| 200             | 1.38%            | 1.15%        | 0.09%            |
| 300             | 1.33%            | 1.21%        | 0.11%            |
| 400             | 1.36%            | 1.25%        | 0.10%            |
| 500             | 1.25%            | 1.22%        | 0.10%            |
|---------------- |------------------|--------------|------------------|

 Time based wakeup - Above %
 -----------------------------------------------------------------------
| Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5               | 0.00%            | 0.00%        | 0.00%            |
| 10              | 0.00%            | 0.00%        | 0.00%            |
| 30              | 0.00%            | 0.00%        | 0.00%            |
| 100             | 0.01%            | 0.00%        | 0.00%            |
| 120             | 0.00%            | 0.00%        | 0.15%            |
| 130             | 15.84%           | 0.14%        | 0.23%            |
| 150             | 0.39%            | 0.23%        | 0.48%            |
| 200             | 0.95%            | 0.87%        | 0.10%            |
| 300             | 0.20%            | 0.17%        | 0.15%            |
| 400             | 0.14%            | 0.12%        | 0.17%            |
| 500             | 0.10%            | 0.20%        | 0.11%            |
|---------------- |------------------|--------------|------------------|

 Time based wakeup - Below %
 -----------------------------------------------------------------------
| Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5               | 0.00%            | 0.00%        | 0.00%            |
| 10              | 0.00%            | 0.00%        | 0.00%            |
| 30              | 0.00%            | 0.00%        | 0.00%            |
| 100             | 0.00%            | 0.00%        | 0.00%            |
| 120             | 1.85%            | 1.66%        | 2.67%            |
| 130             | 16.71%           | 1.13%        | 1.11%            |
| 150             | 1.36%            | 1.16%        | 1.13%            |
| 200             | 1.33%            | 1.14%        | 1.19%            |
| 300             | 1.44%            | 1.20%        | 1.17%            |
| 400             | 1.51%            | 1.21%        | 1.21%            |
| 500             | 1.42%            | 1.24%        | 1.25%            |
|---------------- |------------------|--------------|------------------|


(*) - Above and below values are higher even without the patch with menu governor,
      this issue still persists, as previously reported in [2]. I will investigate
      further and submit a revision to get additional feedback.

I also carried out some benchmarks using pgbench:

pgbench Results

Without Patch:
-----------------------------------------------------------------------------
| Run  | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
-----------------------------------------------------------------------------
|  1   |  11,936,327            |  0.050           |  198,946.141025         |
|  2   |  11,899,540            |  0.050           |  198,333.097547         |
|  3   |  11,870,792            |  0.051           |  197,853.728614         |
|  4   |  11,901,670            |  0.050           |  198,368.526139         |
|  5   |  11,922,046            |  0.050           |  198,708.112243         |
-----------------------------------------------------------------------------
| Avg  |  11,906,075            |  0.0502          |  198,441.921114         |
-----------------------------------------------------------------------------

With Patch:
-----------------------------------------------------------------------------
| Run  | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
-----------------------------------------------------------------------------
|  1   |  12,052,865            |  0.050           |  200,888.492771         |
|  2   |  12,058,359            |  0.050           |  200,979.895325         |
|  3   |  12,071,012            |  0.050           |  201,190.809734         |
|  4   |  12,054,646            |  0.050           |  200,918.076736         |
|  5   |  12,053,087            |  0.050           |  200,892.045581         |
-----------------------------------------------------------------------------
| Avg  |  12,058,394            |  0.0500          |  200,973.464029         |
------------------------------------------------------------------------------

Performance Improvement After Patch:
--------------------------------------------------------------------------------------
| Metric                  | Without Patch (Avg.)| With Patch (Avg.)  | % Improvement |
--------------------------------------------------------------------------------------
| Transactions Processed  |  11,906,075         |  12,058,394        |  +1.28%       | 
| Latency Avg (ms)        |  0.0502             |  0.0500            |  -0.40%       |
| TPS (without init time) |  198,441.921114     |  200,973.464029    |  +1.28%       |
--------------------------------------------------------------------------------------

So with the patch pgbench's "Transactions Processed" improves by ~1.28%, and I did not observe
any major variations on other benchmarks with the patch.

So for the entire series:

Tested-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>

I'm also trying a minimal unit fuzz-test with the pre- and post- patched version of the get_typical_interval 
function to understand this better, will post the results soon.

[1] - https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
[2] - https://lore.kernel.org/all/20240809073120.250974-1-aboorvad@linux.ibm.com/


Thanks,
Aboorva