[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0gFbbw18poYfv77d9L0H2N0Q9exW96MNQMVganFkydN3Q@mail.gmail.com>
Date: Mon, 24 Feb 2025 13:35:25 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Aboorva Devarajan <aboorvad@...ux.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>, Daniel Lezcano <daniel.lezcano@...aro.org>,
Christian Loehle <christian.loehle@....com>,
Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
Subject: Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful
information when processing recent idle intervals
On Mon, Feb 24, 2025 at 7:27 AM Aboorva Devarajan
<aboorvad@...ux.ibm.com> wrote:
>
> On Thu, 2025-02-06 at 15:21 +0100, Rafael J. Wysocki wrote:
>
> > Hi Everyone,
> >
> > This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> > prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> > the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> > reduced kernel overhead. Indeed, it was found during further investigation
> > that the total interrupt rate while running the SPECjbb workload had fallen as
> > a result of that commit by 55% and the local timer interrupt rate had fallen by
> > almost 80%.
> >
> > That turned out to cause the menu cpuidle governor to select the deepest idle
> > state supplied by the cpuidle driver (intel_idle) much more often which added
> > significantly more idle state latency to the workload and that led to the
> > decrease of the critical-jOPS score.
> >
> > Interestingly enough, this problem was not visible when the teo cpuidle
> > governor was used instead of menu, so it appeared to be specific to the
> > latter. CPU wakeup event statistics collected while running the workload
> > indicated that the menu governor was effectively ignoring non-timer wakeup
> > information and all of its idle state selection decisions appeared to be
> > based on timer wakeups only. Thus, it appeared that the reduction of the
> > local timer interrupt rate caused the governor to predict a idle duration
> > much more often while running the workload and the deepest idle state was
> > selected significantly more often as a result of that.
> >
> > A subsequent inspection of the get_typical_interval() function in the menu
> > governor indicated that it might return UINT_MAX too often which then caused
> > the governor's decisions to be based entirely on information related to timers.
> >
> > Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> > cannot make a prediction based on the most recent idle intervals data with
> > sufficiently high confidence, but at least in some cases this means that
> > useful information is not taken into account at all which may lead to
> > significant idle state selection mistakes. Moreover, this is not really
> > unlikely to happen.
> >
> > One issue with get_typical_interval() is that, when it eliminates outliers from
> > the sample set in an attempt to reduce the standard deviation (and so improve
> > the prediction confidence), it does that by dropping high-end samples only,
> > while samples at the low end of the set are retained. However, the samples
> > at the low end very well may be the outliers and they should be eliminated
> > from the sample set instead of the high-end samples. Accordingly, the
> > likelihood of making a meaningful idle duration prediction can be improved
> > by making it also eliminate low-end samples if they are farther from the
> > average than high-end samples. This is done in patch [4/5].
> >
> > Another issue is that get_typical_interval() gives up after eliminating 1/4
> > of the samples if the standard deviation is still not as low as desired (within
> > 1/6 of the average or within 20 us if the average is close to 0), but the
> > remaining samples in the set still represent useful information at that point
> > and discarding them altogether may lead to suboptimal idle state selection.
> >
> > For instance, the largest idle duration value in the get_typical_interval()
> > data set is the maximum idle duration observed recently and it is likely that
> > the upcoming idle duration will not exceed it. Therefore, in the absence of
> > a better choice, this value can be used as an upper bound on the target
> > residency of the idle state to select. Patch [5/5] works along these lines,
> > but it takes the maximum data point remaining after the elimination of
> > outliers.
> >
> > The first two patches in the series are straightforward cleanups (in fact,
> > the first patch is kind of reversed by patch [4/5], but it is there because
> > it can be applied without the latter) and patch [3/5] is a cosmetic change
> > made in preparation for the subsequent ones.
> >
> > This series turns out to restore the SPECjbb critical-jOPS metric on affected
> > systems to the level from before commit 0611a640e60a and it also happens to
> > increase its max-jOPS metric by around 3%.
> >
> > For easier reference/testing it is present in the git branch at
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> >
> > based on the cpuidle material that went into 6.14-rc1.
> >
> > If possible, please let me know if it works for you.
> >
> > Thanks!
> >
> >
> > [1] Link: https://www.spec.org/jbb2015/
> >
> >
> >
>
> Hi Rafael,
>
> I ran some tests using a cpuidle microbenchmark that I have [1]:
>
> The test does a uniform distribution of idle durations, which means the past eight idle intervals approximately match
> the sleep duration. So as anticipated, this change-set does not impact this case, and the behavior remains mostly
> consistent even after applying the patch.
>
> Pipe based wakeup - Above %
> ----------------------------------------------------------------------
> | Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5 | 0.00% | 0.00% | 0.00% |
> | 10 | 0.00% | 0.00% | 0.00% |
> ------------------------------------------------------------------------
> | 30 | 99.97% | 99.96% | 0.00% |
> | 100 | 98.61% | 98.67% | 0.00% |
> | 120 | 97.88% | 98.42% | 1.03% | -> (*)
> ------------------------------------------------------------------------
> | 130 | 2.82% | 1.26% | 0.22% |
> | 150 | 1.68% | 1.43% | 0.32% |
> | 200 | 2.09% | 1.91% | 0.15% |
> | 300 | 1.22% | 1.22% | 0.19% |
> | 400 | 1.20% | 1.21% | 0.19% |
> | 500 | 1.16% | 1.12% | 0.12% |
> ------------------------------------------------------------------------
>
> Pipe based wakeup - Below %
> -----------------------------------------------------------------------
> | Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5 | 0.00% | 0.00% | 0.00% |
> | 10 | 0.00% | 0.00% | 0.00% |
> | 30 | 0.00% | 0.00% | 0.00% |
> | 100 | 0.00% | 0.00% | 0.00% |
> | 120 | 2.76% | 1.14% | 0.93% |
> | 130 | 3.11% | 1.13% | 0.12% |
> | 150 | 1.34% | 1.16% | 0.18% |
> | 200 | 1.38% | 1.15% | 0.09% |
> | 300 | 1.33% | 1.21% | 0.11% |
> | 400 | 1.36% | 1.25% | 0.10% |
> | 500 | 1.25% | 1.22% | 0.10% |
> |---------------- |------------------|--------------|------------------|
>
> Time based wakeup - Above %
> -----------------------------------------------------------------------
> | Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5 | 0.00% | 0.00% | 0.00% |
> | 10 | 0.00% | 0.00% | 0.00% |
> | 30 | 0.00% | 0.00% | 0.00% |
> | 100 | 0.01% | 0.00% | 0.00% |
> | 120 | 0.00% | 0.00% | 0.15% |
> | 130 | 15.84% | 0.14% | 0.23% |
> | 150 | 0.39% | 0.23% | 0.48% |
> | 200 | 0.95% | 0.87% | 0.10% |
> | 300 | 0.20% | 0.17% | 0.15% |
> | 400 | 0.14% | 0.12% | 0.17% |
> | 500 | 0.10% | 0.20% | 0.11% |
> |---------------- |------------------|--------------|------------------|
>
> Time based wakeup - Below %
> -----------------------------------------------------------------------
> | Sleep time (s) | Menu No Patch(%) | Menu Patch(%)| Teo No Patch (%) |
> |---------------- |------------------|--------------|------------------|
> | 5 | 0.00% | 0.00% | 0.00% |
> | 10 | 0.00% | 0.00% | 0.00% |
> | 30 | 0.00% | 0.00% | 0.00% |
> | 100 | 0.00% | 0.00% | 0.00% |
> | 120 | 1.85% | 1.66% | 2.67% |
> | 130 | 16.71% | 1.13% | 1.11% |
> | 150 | 1.36% | 1.16% | 1.13% |
> | 200 | 1.33% | 1.14% | 1.19% |
> | 300 | 1.44% | 1.20% | 1.17% |
> | 400 | 1.51% | 1.21% | 1.21% |
> | 500 | 1.42% | 1.24% | 1.25% |
> |---------------- |------------------|--------------|------------------|
>
>
> (*) - Above and below values are higher even without the patch with menu governor,
> this issue still persists, as previously reported in [2]. I will investigate
> further and submit a revision to get additional feedback.
>
> I also carried out some benchmarks using pgbench:
>
> pgbench Results
>
> Without Patch:
> -----------------------------------------------------------------------------
> | Run | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
> -----------------------------------------------------------------------------
> | 1 | 11,936,327 | 0.050 | 198,946.141025 |
> | 2 | 11,899,540 | 0.050 | 198,333.097547 |
> | 3 | 11,870,792 | 0.051 | 197,853.728614 |
> | 4 | 11,901,670 | 0.050 | 198,368.526139 |
> | 5 | 11,922,046 | 0.050 | 198,708.112243 |
> -----------------------------------------------------------------------------
> | Avg | 11,906,075 | 0.0502 | 198,441.921114 |
> -----------------------------------------------------------------------------
>
> With Patch:
> -----------------------------------------------------------------------------
> | Run | Transactions Processed | Latency Avg (ms) | TPS (without init time) |
> -----------------------------------------------------------------------------
> | 1 | 12,052,865 | 0.050 | 200,888.492771 |
> | 2 | 12,058,359 | 0.050 | 200,979.895325 |
> | 3 | 12,071,012 | 0.050 | 201,190.809734 |
> | 4 | 12,054,646 | 0.050 | 200,918.076736 |
> | 5 | 12,053,087 | 0.050 | 200,892.045581 |
> -----------------------------------------------------------------------------
> | Avg | 12,058,394 | 0.0500 | 200,973.464029 |
> ------------------------------------------------------------------------------
>
> Performance Improvement After Patch:
> --------------------------------------------------------------------------------------
> | Metric | Without Patch (Avg.)| With Patch (Avg.) | % Improvement |
> --------------------------------------------------------------------------------------
> | Transactions Processed | 11,906,075 | 12,058,394 | +1.28% |
> | Latency Avg (ms) | 0.0502 | 0.0500 | -0.40% |
> | TPS (without init time) | 198,441.921114 | 200,973.464029 | +1.28% |
> --------------------------------------------------------------------------------------
>
> So with the patch pgbench's "Transactions Processed" improves by ~1.28%, and I did not observe
> any major variations on other benchmarks with the patch.
>
> So for the entire series:
>
> Tested-by: Aboorva Devarajan <aboorvad@...ux.ibm.com>
Thank you!
> I'm also trying a minimal unit fuzz-test with the pre- and post- patched version of the get_typical_interval
> function to understand this better, will post the results soon.
>
> [1] - https://github.com/AboorvaDevarajan/linux-utils/tree/main/cpuidle/cpuidle_wakeup
> [2] - https://lore.kernel.org/all/20240809073120.250974-1-aboorvad@linux.ibm.com/
Powered by blists - more mailing lists