linux-kernel - Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful information when processing recent idle intervals

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0gFbbw18poYfv77d9L0H2N0Q9exW96MNQMVganFkydN3Q@mail.gmail.com>
Date: Mon, 24 Feb 2025 13:35:25 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Christian Loehle <christian.loehle@....com>, 
	Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
Subject: Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful
 information when processing recent idle intervals

On Mon, Feb 24, 2025 at 7:27 AM Aboorva Devarajan
<aboorvad@...ux.ibm.com> wrote:
>
> On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
>
> > Hi Everyone,
> >
> > This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> > reduced kernel overhead.  Indeed, it was found during further investigation
> > that the total interrupt rate while running the SPECjbb workload had fallen as
> > a result of that commit by 55% and the local timer interrupt rate had fallen by
> > almost 80%.
> >
> > That turned out to cause the menu cpuidle governor to select the deepest idle
> > state supplied by the cpuidle driver (intel_idle) much more often which added
> > significantly more idle state latency to the workload and that led to the
> > decrease of the critical-jOPS score.
> >
> > Interestingly enough, this problem was not visible when the teo cpuidle
> > governor was used instead of menu, so it appeared to be specific to the
> > latter.  CPU wakeup event statistics collected while running the workload
> > indicated that the menu governor was effectively ignoring non-timer wakeup
> > information and all of its idle state selection decisions appeared to be
> > based on timer wakeups only.  Thus, it appeared that the reduction of the
> > local timer interrupt rate caused the governor to predict a idle duration
> > much more often while running the workload and the deepest idle state was
> > selected significantly more often as a result of that.
> >
> > A subsequent inspection of the get_typical_interval() function in the menu
> > governor indicated that it might return UINT_MAX too often which then caused
> > the governor's decisions to be based entirely on information related to timers.
> >
> > Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> > cannot make a prediction based on the most recent idle intervals data with
> > sufficiently high confidence, but at least in some cases this means that
> > useful information is not taken into account at all which may lead to
> > significant idle state selection mistakes.  Moreover, this is not really
> > unlikely to happen.
> >
> > One issue with get_typical_interval() is that, when it eliminates outliers from
> > the sample set in an attempt to reduce the standard deviation (and so improve
> > the prediction confidence), it does that by dropping high-end samples only,
> > while samples at the low end of the set are retained.  However, the samples
> > at the low end very well may be the outliers and they should be eliminated
> > from the sample set instead of the high-end samples.  Accordingly, the
> > likelihood of making a meaningful idle duration prediction can be improved
> > by making it also eliminate low-end samples if they are farther from the
> > average than high-end samples.  This is done in patch [4/5].
> >
> > Another issue is that get_typical_interval() gives up after eliminating 1/4
> > of the samples if the standard deviation is still not as low as desired (within
> > 1/6 of the average or within 20 us if the average is close to 0), but the
> > remaining samples in the set still represent useful information at that point
> > and discarding them altogether may lead to suboptimal idle state selection.
> >
> > For instance, the largest idle duration value in the get_typical_interval()
> > data set is the maximum idle duration observed recently and it is likely that
> > the upcoming idle duration will not exceed it.  Therefore, in the absence of
> > a better choice, this value can be used as an upper bound on the target
> > residency of the idle state to select.  Patch [5/5] works along these lines,
> > but it takes the maximum data point remaining after the elimination of
> > outliers.
> >
> > The first two patches in the series are straightforward cleanups (in fact,
> > the first patch is kind of reversed by patch [4/5], but it is there because
> > it can be applied without the latter) and patch [3/5] is a cosmetic change
> > made in preparation for the subsequent ones.
> >
> > This series turns out to restore the SPECjbb critical-jOPS metric on affected
> > systems to the level from before commit 0611a640e60a and it also happens to
> > increase its max-jOPS metric by around 3%.
> >
> > For easier reference/testing it is present in the git branch at
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> >
> > based on the cpuidle material that went into 6.14-rc1.
> >
> > If possible, please let me know if it works for you.
> >
> > Thanks!
> >
> >
> > [1] Link: https://www.spec.org/jbb2015/
> >
> >
> >
>
> Hi Rafael,
>
> I ran some tests using a cpuidle microbenchmark that I have [1]:
>
> The test does a uniform distribution of idle durations, which means the past eight idle intervals approximately match
> the sleep duration. So as anticipated, this change-set does not impact this case, and the behavior remains mostly
> consistent even after applying the patch.
>
>  Pipe based wakeup - Above %
>  ----------------------------------------------------------------------
> | Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5               | 0.00%            | 0.00%        | 0.00%            |
> | 10              | 0.00%            | 0.00%        | 0.00%            |
> ------------------------------------------------------------------------
> | 30              | 99.97%           | 99.96%       | 0.00%            |
> | 100             | 98.61%           | 98.67%       | 0.00%            |
> | 120             | 97.88%           | 98.42%       | 1.03%            | -> (*)
> ------------------------------------------------------------------------
> | 130             | 2.82%            | 1.26%        | 0.22%            |
> | 150             | 1.68%            | 1.43%        | 0.32%            |
> | 200             | 2.09%            | 1.91%        | 0.15%            |
> | 300             | 1.22%            | 1.22%        | 0.19%            |
> | 400             | 1.20%            | 1.21%        | 0.19%            |
> | 500             | 1.16%            | 1.12%        | 0.12%            |
> ------------------------------------------------------------------------
>
>  Pipe based wakeup - Below %
>  -----------------------------------------------------------------------
> | Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5               | 0.00%            | 0.00%        | 0.00%            |
> | 10              | 0.00%            | 0.00%        | 0.00%            |
> | 30              | 0.00%            | 0.00%        | 0.00%            |
> | 100             | 0.00%            | 0.00%        | 0.00%            |
> | 120             | 2.76%            | 1.14%        | 0.93%            |
> | 130             | 3.11%            | 1.13%        | 0.12%            |
> | 150             | 1.34%            | 1.16%        | 0.18%            |
> | 200             | 1.38%            | 1.15%        | 0.09%            |
> | 300             | 1.33%            | 1.21%        | 0.11%            |
> | 400             | 1.36%            | 1.25%        | 0.10%            |
> | 500             | 1.25%            | 1.22%        | 0.10%            |
> |---------------- |------------------|--------------|------------------|
>
>  Time based wakeup - Above %
>  -----------------------------------------------------------------------
> | Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5               | 0.00%            | 0.00%        | 0.00%            |
> | 10              | 0.00%            | 0.00%        | 0.00%            |
> | 30              | 0.00%            | 0.00%        | 0.00%            |
> | 100             | 0.01%            | 0.00%        | 0.00%            |
> | 120             | 0.00%            | 0.00%        | 0.15%            |
> | 130             | 15.84%           | 0.14%        | 0.23%            |
> | 150             | 0.39%            | 0.23%        | 0.48%            |
> | 200             | 0.95%            | 0.87%        | 0.10%            |
> | 300             | 0.20%            | 0.17%        | 0.15%            |
> | 400             | 0.14%            | 0.12%        | 0.17%            |
> | 500             | 0.10%            | 0.20%        | 0.11%            |
> |---------------- |------------------|--------------|------------------|
>
>  Time based wakeup - Below %
>  -----------------------------------------------------------------------
> | Sleep time (s)  | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5               | 0.00%            | 0.00%        | 0.00%            |
> | 10              | 0.00%            | 0.00%        | 0.00%            |
> | 30              | 0.00%            | 0.00%        | 0.00%            |
> | 100             | 0.00%            | 0.00%        | 0.00%            |
> | 120             | 1.85%            | 1.66%        | 2.67%            |
> | 130             | 16.71%           | 1.13%        | 1.11%            |
> | 150             | 1.36%            | 1.16%        | 1.13%            |
> | 200             | 1.33%            | 1.14%        | 1.19%            |
> | 300             | 1.44%            | 1.20%        | 1.17%            |
> | 400             | 1.51%            | 1.21%        | 1.21%            |
> | 500             | 1.42%            | 1.24%        | 1.25%            |
> |---------------- |------------------|--------------|------------------|
>
>
> (*) - Above and below values are higher even without the patch with menu governor,
>       this issue still persists, as previously reported in [2]. I will investigate
>       further and submit a revision to get additional feedback.
>
> I also carried out some benchmarks using pgbench:
>
> pgbench Results
>
> Without Patch:
> -----------------------------------------------------------------------------
> | Run  | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
> -----------------------------------------------------------------------------
> |  1   |  11,936,327            |  0.050           |  198,946.141025         |
> |  2   |  11,899,540            |  0.050           |  198,333.097547         |
> |  3   |  11,870,792            |  0.051           |  197,853.728614         |
> |  4   |  11,901,670            |  0.050           |  198,368.526139         |
> |  5   |  11,922,046            |  0.050           |  198,708.112243         |
> -----------------------------------------------------------------------------
> | Avg  |  11,906,075            |  0.0502          |  198,441.921114         |
> -----------------------------------------------------------------------------
>
> With Patch:
> -----------------------------------------------------------------------------
> | Run  | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
> -----------------------------------------------------------------------------
> |  1   |  12,052,865            |  0.050           |  200,888.492771         |
> |  2   |  12,058,359            |  0.050           |  200,979.895325         |
> |  3   |  12,071,012            |  0.050           |  201,190.809734         |
> |  4   |  12,054,646            |  0.050           |  200,918.076736         |
> |  5   |  12,053,087            |  0.050           |  200,892.045581         |
> -----------------------------------------------------------------------------
> | Avg  |  12,058,394            |  0.0500          |  200,973.464029         |
> ------------------------------------------------------------------------------
>
> Performance Improvement After Patch:
> --------------------------------------------------------------------------------------
> | Metric                  | Without Patch (Avg.)| With Patch (Avg.)  | % Improvement |
> --------------------------------------------------------------------------------------
> | Transactions Processed  |  11,906,075         |  12,058,394        |  +1.28%       |
> | Latency Avg (ms)        |  0.0502             |  0.0500            |  -0.40%       |
> | TPS (without init time) |  198,441.921114     |  200,973.464029    |  +1.28%       |
> --------------------------------------------------------------------------------------
>
> So with the patch pgbench's "Transactions Processed" improves by ~1.28%, and I did not observe
> any major variations on other benchmarks with the patch.
>
> So for the entire series:
>
> Tested-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>

Thank you!

> I'm also trying a minimal unit fuzz-test with the pre- and post- patched version of the get_typical_interval
> function to understand this better, will post the results soon.
>
> [1] - https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> [2] - https://lore.kernel.org/all/20240809073120.250974-1-aboorvad@linux.ibm.com/