[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hob-+gv5UjXCaS1fQ2kn5u_WU_x+kkMiEXiisnt91bZQ@mail.gmail.com>
Date: Wed, 19 Feb 2025 13:06:37 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Christian Loehle <christian.loehle@....com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>, dsmythies@...us.net,
LKML <linux-kernel@...r.kernel.org>, Daniel Lezcano <daniel.lezcano@...aro.org>,
Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>,
Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [RFT][PATCH v1 0/5] cpuidle: menu: Avoid discarding useful
information when processing recent idle intervals
On Tue, Feb 18, 2025 at 10:17 PM Christian Loehle
<christian.loehle@....com> wrote:
>
> On 2/10/25 14:15, Christian Loehle wrote:
> > On 2/6/25 14:21, Rafael J. Wysocki wrote:
> >> Hi Everyone,
> >>
> >> This work had been triggered by a report that commit 0611a640e60a ("eventpoll:
> >> prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of
> >> the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally
> >> reduced kernel overhead. Indeed, it was found during further investigation
> >> that the total interrupt rate while running the SPECjbb workload had fallen as
> >> a result of that commit by 55% and the local timer interrupt rate had fallen by
> >> almost 80%.
> >>
> >> That turned out to cause the menu cpuidle governor to select the deepest idle
> >> state supplied by the cpuidle driver (intel_idle) much more often which added
> >> significantly more idle state latency to the workload and that led to the
> >> decrease of the critical-jOPS score.
> >>
> >> Interestingly enough, this problem was not visible when the teo cpuidle
> >> governor was used instead of menu, so it appeared to be specific to the
> >> latter. CPU wakeup event statistics collected while running the workload
> >> indicated that the menu governor was effectively ignoring non-timer wakeup
> >> information and all of its idle state selection decisions appeared to be
> >> based on timer wakeups only. Thus, it appeared that the reduction of the
> >> local timer interrupt rate caused the governor to predict a idle duration
> >> much more often while running the workload and the deepest idle state was
> >> selected significantly more often as a result of that.
> >>
> >> A subsequent inspection of the get_typical_interval() function in the menu
> >> governor indicated that it might return UINT_MAX too often which then caused
> >> the governor's decisions to be based entirely on information related to timers.
> >>
> >> Generally speaking, UINT_MAX is returned by get_typical_interval() if it
> >> cannot make a prediction based on the most recent idle intervals data with
> >> sufficiently high confidence, but at least in some cases this means that
> >> useful information is not taken into account at all which may lead to
> >> significant idle state selection mistakes. Moreover, this is not really
> >> unlikely to happen.
> >>
> >> One issue with get_typical_interval() is that, when it eliminates outliers from
> >> the sample set in an attempt to reduce the standard deviation (and so improve
> >> the prediction confidence), it does that by dropping high-end samples only,
> >> while samples at the low end of the set are retained. However, the samples
> >> at the low end very well may be the outliers and they should be eliminated
> >> from the sample set instead of the high-end samples. Accordingly, the
> >> likelihood of making a meaningful idle duration prediction can be improved
> >> by making it also eliminate low-end samples if they are farther from the
> >> average than high-end samples. This is done in patch [4/5].
> >>
> >> Another issue is that get_typical_interval() gives up after eliminating 1/4
> >> of the samples if the standard deviation is still not as low as desired (within
> >> 1/6 of the average or within 20 us if the average is close to 0), but the
> >> remaining samples in the set still represent useful information at that point
> >> and discarding them altogether may lead to suboptimal idle state selection.
> >>
> >> For instance, the largest idle duration value in the get_typical_interval()
> >> data set is the maximum idle duration observed recently and it is likely that
> >> the upcoming idle duration will not exceed it. Therefore, in the absence of
> >> a better choice, this value can be used as an upper bound on the target
> >> residency of the idle state to select. Patch [5/5] works along these lines,
> >> but it takes the maximum data point remaining after the elimination of
> >> outliers.
> >>
> >> The first two patches in the series are straightforward cleanups (in fact,
> >> the first patch is kind of reversed by patch [4/5], but it is there because
> >> it can be applied without the latter) and patch [3/5] is a cosmetic change
> >> made in preparation for the subsequent ones.
> >>
> >> This series turns out to restore the SPECjbb critical-jOPS metric on affected
> >> systems to the level from before commit 0611a640e60a and it also happens to
> >> increase its max-jOPS metric by around 3%.
> >>
> >> For easier reference/testing it is present in the git branch at
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/menu
> >>
> >> based on the cpuidle material that went into 6.14-rc1.
> >>
> >> If possible, please let me know if it works for you.
> >>
> >> Thanks!
> >>
> >>
> >> [1] Link: https://www.spec.org/jbb2015/
>
> Another dump for x86 idle this time, tldr: no worrying idle/power numbers
[cut]
> Significantly improving idle miss belows.
>
> I'll do the Android tests, but that is very unlikely to show something this
> doesn't (there's only one non-WFI idle state and most workloads are intercept
> heavy, so if anything menu-5 should improve the overall situation.)
> Feel free to already add:
> Tested-by: Christian Loehle <christian.loehle@....com>
Thank you!
Powered by blists - more mailing lists