[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0c013d5d2a9251d5dc468446f2a08ae8a7a8953.camel@linux.ibm.com>
Date: Mon, 24 Feb 2025 11:57:22 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: "Rafael J. Wysocki" <rjw@...ysocki.net>,
Linux PM
<linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Daniel Lezcano
<daniel.lezcano@...aro.org>,
Christian Loehle <christian.loehle@....com>,
Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>,
aboorvad@...ux.ibm.com
Subject: Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful
information when processing recent idle intervals
On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
> Hi Everyone,
>
> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> reduced kernel overhead. Indeed, it was found during further investigation
> that the total interrupt rate while running the SPECjbb workload had fallen as
> a result of that commit by 55% and the local timer interrupt rate had fallen by
> almost 80%.
>
> That turned out to cause the menu cpuidle governor to select the deepest idle
> state supplied by the cpuidle driver (intel_idle) much more often which added
> significantly more idle state latency to the workload and that led to the
> decrease of the critical-jOPS score.
>
> Interestingly enough, this problem was not visible when the teo cpuidle
> governor was used instead of menu, so it appeared to be specific to the
> latter. CPU wakeup event statistics collected while running the workload
> indicated that the menu governor was effectively ignoring non-timer wakeup
> information and all of its idle state selection decisions appeared to be
> based on timer wakeups only. Thus, it appeared that the reduction of the
> local timer interrupt rate caused the governor to predict a idle duration
> much more often while running the workload and the deepest idle state was
> selected significantly more often as a result of that.
>
> A subsequent inspection of the get_typical_interval() function in the menu
> governor indicated that it might return UINT_MAX too often which then caused
> the governor's decisions to be based entirely on information related to timers.
>
> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> cannot make a prediction based on the most recent idle intervals data with
> sufficiently high confidence, but at least in some cases this means that
> useful information is not taken into account at all which may lead to
> significant idle state selection mistakes. Moreover, this is not really
> unlikely to happen.
>
> One issue with get_typical_interval() is that, when it eliminates outliers from
> the sample set in an attempt to reduce the standard deviation (and so improve
> the prediction confidence), it does that by dropping high-end samples only,
> while samples at the low end of the set are retained. However, the samples
> at the low end very well may be the outliers and they should be eliminated
> from the sample set instead of the high-end samples. Accordingly, the
> likelihood of making a meaningful idle duration prediction can be improved
> by making it also eliminate low-end samples if they are farther from the
> average than high-end samples. This is done in patch [4/5].
>
> Another issue is that get_typical_interval() gives up after eliminating 1/4
> of the samples if the standard deviation is still not as low as desired (within
> 1/6 of the average or within 20 us if the average is close to 0), but the
> remaining samples in the set still represent useful information at that point
> and discarding them altogether may lead to suboptimal idle state selection.
>
> For instance, the largest idle duration value in the get_typical_interval()
> data set is the maximum idle duration observed recently and it is likely that
> the upcoming idle duration will not exceed it. Therefore, in the absence of
> a better choice, this value can be used as an upper bound on the target
> residency of the idle state to select. Patch [5/5] works along these lines,
> but it takes the maximum data point remaining after the elimination of
> outliers.
>
> The first two patches in the series are straightforward cleanups (in fact,
> the first patch is kind of reversed by patch [4/5], but it is there because
> it can be applied without the latter) and patch [3/5] is a cosmetic change
> made in preparation for the subsequent ones.
>
> This series turns out to restore the SPECjbb critical-jOPS metric on affected
> systems to the level from before commit 0611a640e60a and it also happens to
> increase its max-jOPS metric by around 3%.
>
> For easier reference/testing it is present in the git branch at
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
>
> based on the cpuidle material that went into 6.14-rc1.
>
> If possible, please let me know if it works for you.
>
> Thanks!
>
>
> [1] Link: https://www.spec.org/jbb2015/
>
>
>
Hi Rafael,
I ran some tests using a cpuidle microbenchmark that I have [1]:
The test does a uniform distribution of idle durations, which means the past eight idle intervals approximately match
the sleep duration. So as anticipated, this change-set does not impact this case, and the behavior remains mostly
consistent even after applying the patch.
Pipe based wakeup - Above %
----------------------------------------------------------------------
| Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5 | 0.00% | 0.00% | 0.00% |
| 10 | 0.00% | 0.00% | 0.00% |
------------------------------------------------------------------------
| 30 | 99.97% | 99.96% | 0.00% |
| 100 | 98.61% | 98.67% | 0.00% |
| 120 | 97.88% | 98.42% | 1.03% | -> (*)
------------------------------------------------------------------------
| 130 | 2.82% | 1.26% | 0.22% |
| 150 | 1.68% | 1.43% | 0.32% |
| 200 | 2.09% | 1.91% | 0.15% |
| 300 | 1.22% | 1.22% | 0.19% |
| 400 | 1.20% | 1.21% | 0.19% |
| 500 | 1.16% | 1.12% | 0.12% |
------------------------------------------------------------------------
Pipe based wakeup - Below %
-----------------------------------------------------------------------
| Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5 | 0.00% | 0.00% | 0.00% |
| 10 | 0.00% | 0.00% | 0.00% |
| 30 | 0.00% | 0.00% | 0.00% |
| 100 | 0.00% | 0.00% | 0.00% |
| 120 | 2.76% | 1.14% | 0.93% |
| 130 | 3.11% | 1.13% | 0.12% |
| 150 | 1.34% | 1.16% | 0.18% |
| 200 | 1.38% | 1.15% | 0.09% |
| 300 | 1.33% | 1.21% | 0.11% |
| 400 | 1.36% | 1.25% | 0.10% |
| 500 | 1.25% | 1.22% | 0.10% |
|---------------- |------------------|--------------|------------------|
Time based wakeup - Above %
-----------------------------------------------------------------------
| Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5 | 0.00% | 0.00% | 0.00% |
| 10 | 0.00% | 0.00% | 0.00% |
| 30 | 0.00% | 0.00% | 0.00% |
| 100 | 0.01% | 0.00% | 0.00% |
| 120 | 0.00% | 0.00% | 0.15% |
| 130 | 15.84% | 0.14% | 0.23% |
| 150 | 0.39% | 0.23% | 0.48% |
| 200 | 0.95% | 0.87% | 0.10% |
| 300 | 0.20% | 0.17% | 0.15% |
| 400 | 0.14% | 0.12% | 0.17% |
| 500 | 0.10% | 0.20% | 0.11% |
|---------------- |------------------|--------------|------------------|
Time based wakeup - Below %
-----------------------------------------------------------------------
| Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
|---------------- |------------------|--------------|------------------|
| 5 | 0.00% | 0.00% | 0.00% |
| 10 | 0.00% | 0.00% | 0.00% |
| 30 | 0.00% | 0.00% | 0.00% |
| 100 | 0.00% | 0.00% | 0.00% |
| 120 | 1.85% | 1.66% | 2.67% |
| 130 | 16.71% | 1.13% | 1.11% |
| 150 | 1.36% | 1.16% | 1.13% |
| 200 | 1.33% | 1.14% | 1.19% |
| 300 | 1.44% | 1.20% | 1.17% |
| 400 | 1.51% | 1.21% | 1.21% |
| 500 | 1.42% | 1.24% | 1.25% |
|---------------- |------------------|--------------|------------------|
(*) - Above and below values are higher even without the patch with menu governor,
this issue still persists, as previously reported in [2]. I will investigate
further and submit a revision to get additional feedback.
I also carried out some benchmarks using pgbench:
pgbench Results
Without Patch:
-----------------------------------------------------------------------------
| Run | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
-----------------------------------------------------------------------------
| 1 | 11,936,327 | 0.050 | 198,946.141025 |
| 2 | 11,899,540 | 0.050 | 198,333.097547 |
| 3 | 11,870,792 | 0.051 | 197,853.728614 |
| 4 | 11,901,670 | 0.050 | 198,368.526139 |
| 5 | 11,922,046 | 0.050 | 198,708.112243 |
-----------------------------------------------------------------------------
| Avg | 11,906,075 | 0.0502 | 198,441.921114 |
-----------------------------------------------------------------------------
With Patch:
-----------------------------------------------------------------------------
| Run | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
-----------------------------------------------------------------------------
| 1 | 12,052,865 | 0.050 | 200,888.492771 |
| 2 | 12,058,359 | 0.050 | 200,979.895325 |
| 3 | 12,071,012 | 0.050 | 201,190.809734 |
| 4 | 12,054,646 | 0.050 | 200,918.076736 |
| 5 | 12,053,087 | 0.050 | 200,892.045581 |
-----------------------------------------------------------------------------
| Avg | 12,058,394 | 0.0500 | 200,973.464029 |
------------------------------------------------------------------------------
Performance Improvement After Patch:
--------------------------------------------------------------------------------------
| Metric | Without Patch (Avg.)| With Patch (Avg.) | % Improvement |
--------------------------------------------------------------------------------------
| Transactions Processed | 11,906,075 | 12,058,394 | +1.28% |
| Latency Avg (ms) | 0.0502 | 0.0500 | -0.40% |
| TPS (without init time) | 198,441.921114 | 200,973.464029 | +1.28% |
--------------------------------------------------------------------------------------
So with the patch pgbench's "Transactions Processed" improves by ~1.28%, and I did not observe
any major variations on other benchmarks with the patch.
So for the entire series:
Tested-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>
I'm also trying a minimal unit fuzz-test with the pre- and post- patched version of the get_typical_interval
function to understand this better, will post the results soon.
[1] - https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
[2] - https://lore.kernel.org/all/20240809073120.250974-1-aboorvad@linux.ibm.com/
Thanks,
Aboorva
Powered by blists - more mailing lists