linux-kernel - Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <12587057.O9o76ZdvQC@rjwysocki.net>
Date: Fri, 10 Jan 2025 18:23:49 +0100
From: "Rafael J. Wysocki" <rjw@...ysocki.net>
To: "Rafael J. Wysocki" <rafael@...nel.org>,
 Christian Loehle <christian.loehle@....com>
Cc: Linux PM <linux-pm@...r.kernel.org>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>,
 LKML <linux-kernel@...r.kernel.org>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
Subject:
 Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path

On Friday, January 10, 2025 3:52:09 PM CET Christian Loehle wrote:
> On 1/10/25 13:34, Rafael J. Wysocki wrote:
> > On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle
> > <christian.loehle@....com> wrote:
> >>
> >> On 1/10/25 12:53, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >>>
> >>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
> >>> call in some cases") the teo governor behaves a bit differently on
> >>> systems where idle state 0 is a "polling" state (that is, it is not
> >>> really an idle state, but a loop continuously executed by the CPU).
> >>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call
> >>> if the target residency of the current candidate idle state is small
> >>> enough.
> >>>
> >>> However, if state 0 itself was to be returned, it would be returned
> >>> right away without calling tick_nohz_get_sleep_length() even on systems
> >>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle:
> >>> teo: Don't count non-existent intercepts") that attempted to fix this
> >>> problem.
> >>>
> >>> Unfortunately, commit 4b20b07ce72f has made the governor always call
> >>> tick_nohz_get_sleep_length() when about to return state 0 early, even
> >>> if that state is a "polling" one, which is inconsistent and defeats
> >>> the purpose of commit 6da8f9ba5a87 in that case.
> >>>
> >>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where
> >>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from
> >>> being called if it is a "polling" state.
> >>>
> >>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts")
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >>> ---
> >>>  drivers/cpuidle/governors/teo.c |    3 ++-
> >>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> --- a/drivers/cpuidle/governors/teo.c
> >>> +++ b/drivers/cpuidle/governors/teo.c
> >>> @@ -422,7 +422,8 @@
> >>>                       first_suitable_idx = i;
> >>>               }
> >>>       }
> >>> -     if (!idx && prev_intercept_idx) {
> >>> +     if (!idx && prev_intercept_idx &&
> >>> +         !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) {
> >>>               /*
> >>>                * We have to query the sleep length here otherwise we don't
> >>>                * know after wakeup if our guess was correct.
> >>>
> >>>
> >>>
> >>
> >> But then you do run into the issue of intercepts not being detected if
> >> state0 is the right choice, don't you?
> > 
> > That's true, but then on systems with a "polling" state 0 you still
> > have this problem if the state returned early is not state 0.  Say C1
> > on x86.> 
> > The point here is that the behavior needs to be consistent, one way or another.
> 
> Yes, gotcha. Why not be consistent 'in the other way' then?
> 
> > 
> > The exact point of commit 6da8f9ba5a87 was to avoid calling
> > tick_nohz_get_sleep_length() in some cases when the state to be
> > returned is shallow enough and obviously that includes a "polling"
> > state 0, possibly at the cost of being somewhat inaccurate in
> > prediction.
> 
> Somewhat inaccurate meaning not making any prediction?
> cpu_data->sleep_length_ns = KTIME_MAX;
> 
> How much is the harm for calling tick_nohz_get_sleep_length() when
> polling anyway?
> I know tick_nohz_get_sleep_length() is the majority of the usual
> cpuidle entry path, but for many scenarios where state0 is appropriate
> that should be pretty fast, no?
> 
> > 
> > Then you're seeing this intercept accumulation for state 0 when there
> > are only 2 states in the table (or all of the other states are much
> > higher target residency than state 0).
> > 
> > Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to
> > be called every time on systems without a "polling" state 0, which was
> > fair enough, but it also affected the other systems, which wasn't.
> > 
> >> This would then enable intercept-detection only for <50% of the time,
> >> another option is to not allow intercepts selecting a polling state, but
> >> there were recent complaints about this exact behavior from Aboorva (+TO).
> >> They don't have a low-latency non-polling state.
> >>
> >> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/
> > 
> > If they don't have a "polling" state 0, they won't be affected by this
> > patch and after commit 4b20b07ce72f, they'll always call
> > tick_nohz_get_sleep_length(), so the current governor behavior is
> > generally unsuitable for them.
> 
> They do though.
> commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state")
> So they have a polling 'snooze' and a relatively high latency (hundreds usecs)
> non-polling state and no deeper state.
> So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy
> workload they will get 50% state0 and 50% state1 (because intercepts recovered
> due to not querying sleep length).
> 
> > 
> > I have an idea how to change it to be more accurate in prediction, but
> > we'll see how it goes.  Stay tuned.
> 
> I will.

So with the appended change applied, there is no difference between "hits" and
"intercepts" any more, they all are just "events" now, so the sleep length value
should not matter unless it is needed to refine the state selection.

Of course, the whole governor can be simplified significantly on top of it.

---
 drivers/cpuidle/governors/teo.c |   98 +++++++++++++++++++---------------------
 1 file changed, 48 insertions(+), 50 deletions(-)

--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -34,12 +34,17 @@
  *
  * The computations carried out by this governor are based on using bins whose
  * boundaries are aligned with the target residency parameter values of the CPU
- * idle states provided by the %CPUIdle driver in the ascending order.  That is,
- * the first bin spans from 0 up to, but not including, the target residency of
- * the second idle state (idle state 1), the second bin spans from the target
- * residency of idle state 1 up to, but not including, the target residency of
- * idle state 2, the third bin spans from the target residency of idle state 2
- * up to, but not including, the target residency of idle state 3 and so on.
+ * idle states provided by the %CPUIdle driver in ascending order, and a given
+ * bin is used for counting events for which the measured idle duration is
+ * between its boundaries.  Specifically, the first bin spans from 0 up to, but
+ * not including, the target residency of the second idle state (idle state 1),
+ * so it counts events for which the measured idle duration is greater than or
+ * equal to 0 and less than the target residency of state 1.  The second bin
+ * spans from the target residency of idle state 1 up to, but not including, the
+ * target residency of idle state 2, the third bin spans from the target
+ * residency of idle state 2 up to, but not including, the target residency of
+ * idle state 3 and so on.
+ *
  * The last bin spans from the target residency of the deepest idle state
  * supplied by the driver to the scheduler tick period length or to infinity if
  * the tick period length is less than the targer residency of that state.  In
@@ -47,47 +52,31 @@
  * duration between the tick period length and the target residency of the
  * deepest idle state.
  *
- * Two metrics called "hits" and "intercepts" are associated with each bin.
- * They are updated every time before selecting an idle state for the given CPU
- * in accordance with what happened last time.
- *
- * The "hits" metric reflects the relative frequency of situations in which the
- * sleep length and the idle duration measured after CPU wakeup fall into the
- * same bin (that is, the CPU appears to wake up "on time" relative to the sleep
- * length).  In turn, the "intercepts" metric reflects the relative frequency of
- * non-timer wakeup events for which the measured idle duration falls into a bin
- * that corresponds to an idle state shallower than the one whose bin is fallen
- * into by the sleep length (these events are also referred to as "intercepts"
- * below).
- *
  * In order to select an idle state for a CPU, the governor takes the following
  * steps (modulo the possible latency constraint that must be taken into account
  * too):
  *
  * 1. Find the deepest enabled CPU idle state (the candidate idle state) and
- *    compute 2 sums as follows:
- *
- *    - The sum of the "hits" metric for all of the idle states shallower than
- *      the candidate one (it represents the cases in which the CPU was likely
- *      woken up by a timer).
- *
- *    - The sum of the "intercepts" metric for all of the idle states shallower
- *      than the candidate one (it represents the cases in which the CPU was
- *      likely woken up by a non-timer wakeup source).
- *
- * 2. If the second sum computed in step 1 is greater than a half of the sum of
- *    both mertics for the candidate state bin and all subsequent bins(if any),
- *    a shallower idle state is likely to be more suitable, so look for it.
+ *    compute a sum of events for all of the idle states shallower than the
+ *    candidate one (it represents the cases in which the candidate idle state
+ *    would have been to deep for the measured idle duration).
+ *
+ * 2. If the sum computed in step 1 is greater than 1/2 of the sum of all
+ *    events, the candidate state would have been too deep for the measured idle
+ *    duration in over 1/2 of all cases, so look for a shallower idle state to
+ *    replace it.
  *
  *    - Traverse the enabled idle states shallower than the candidate one in the
  *      descending order.
  *
- *    - For each of them compute the sum of the "intercepts" metrics over all
- *      of the idle states between it and the candidate one (including the
- *      former and excluding the latter).
- *
- *    - If this sum is greater than a half of the second sum computed in step 1,
- *      use the given idle state as the new candidate one.
+ *    - In each step, compute the sum of events for the given state and all of
+ *      the subsequent bins, including the events with the measured idle
+ *      duration between the tick period length and the target residency of the
+ *      deepest idle state (if the former is less than the latter).
+ *
+ *    - If this sum is greater than 1/2 of the sum of all events. the given
+ *      state would have been shallow enough for the measured idle duration in
+ *      over 1/2 of all cases, so it can be used as the new candidate one.
  *
  * 3. If the current candidate state is state 0 or its target residency is short
  *    enough, return it and prevent the scheduler tick from being stopped.
@@ -288,6 +277,7 @@
 	unsigned int tick_intercept_sum = 0;
 	unsigned int idx_intercept_sum = 0;
 	unsigned int intercept_sum = 0;
+	unsigned int idx_event_sum = 0;
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
 	int constraint_idx = 0;
@@ -361,15 +351,18 @@
 
 	tick_intercept_sum = intercept_sum +
 			cpu_data->state_bins[drv->state_count-1].intercepts;
-
 	/*
-	 * If the sum of the intercepts metric for all of the idle states
-	 * shallower than the current candidate one (idx) is greater than the
-	 * sum of the intercepts and hits metrics for the candidate state and
-	 * all of the deeper states, a shallower idle state is likely to be a
-	 * better choice.
+	 * Compute the sum of both event metrics for all idle states shallower
+	 * than the candidate one (idx).
+	 */
+	idx_event_sum = idx_intercept_sum + idx_hit_sum;
+	/*
+	 * If this sum is greater than 1/2 of total events, the current
+	 * candndate state would have been to deep for the measured idle
+	 * duration in over 50% of recent cases, so look for a shallower one.
 	 */
-	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
+	if (2 * idx_event_sum > cpu_data->total) {
+		unsigned int event_sum = cpu_data->total - idx_event_sum;
 		int first_suitable_idx = idx;
 
 		/*
@@ -380,14 +373,19 @@
 		 * Take the possible duration limitation present if the tick
 		 * has been stopped already into account.
 		 */
-		intercept_sum = 0;
-
 		for (i = idx - 1; i >= 0; i--) {
 			struct teo_bin *bin = &cpu_data->state_bins[i];
 
-			intercept_sum += bin->intercepts;
-
-			if (2 * intercept_sum > idx_intercept_sum) {
+			/*
+			 * Add both metrics for this state to the total sum of
+			 * metrics for all of the subsequent bins.
+			 */
+			event_sum += bin->intercepts + bin->hits;
+			/*
+			 * If this sum is greater than 1/2 of total events,
+			 * replace the previous candidate state with this one.
+			 */
+			if (2 * event_sum > cpu_data->total) {
 				/*
 				 * Use the current state unless it is too
 				 * shallow or disabled, in which case take the