linux-kernel - Re: [PATCH v1 7/9] cpuidle: teo: Skip getting the sleep length is wakeups are very frequent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f1f6fe0e-6d6f-442b-87d2-dd15d2e022bc@arm.com>
Date: Mon, 20 Jan 2025 12:08:10 +0000
From: Christian Loehle <christian.loehle@....com>
To: "Rafael J. Wysocki" <rjw@...ysocki.net>,
 Linux PM <linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
Subject: Re: [PATCH v1 7/9] cpuidle: teo: Skip getting the sleep length is
 wakeups are very frequent

On 1/13/25 18:48, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> 

Title has a typo: s/is/if

> Commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
> call in some cases") attempted to reduce the governor overhead in some
> cases by making it avoid obtaining the sleep length (the time till the
> next timer event) which may be costly.
> 
> Among other things, after the above commit, tick_nohz_get_sleep_length()
> was not called any more when idle state 0 was to be returned, which
> turned out to be problematic and the previous behavior in that respect
> was restored by commit 4b20b07ce72f ("cpuidle: teo: Don't count non-
> existent intercepts").
> 
> However, commit 6da8f9ba5a87 also caused the governor to avoid calling
> tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling"
> one (that is, it is not really an idle state, but a loop continuously
> executed by the CPU) when the target residency of the idle state to be
> returned was low enough, so there was no practical need to refine the
> idle state selection in any way.  This change was not removed by the
> other commit, so now on systems where idle state 0 is a "polling" one,
> tick_nohz_get_sleep_length() is called when idle state 0 is to be
> returned, but it is not called when a deeper idle state with
> sufficiently low target residency is to be returned.  That is arguably
> confusing and inconsistent.
> 
> Moreover, there is no specific reason why the behavior in question
> should depend on whether or not idle state 0 is a "polling" one.
> 
> One way to address this would be to make the governor always call
> tick_nohz_get_sleep_length() to obtain the sleep length, but that would
> effectively mean reverting commit 6da8f9ba5a87 and restoring the latency
> issue that was the reason for doing it.  This approach is thus not
> particularly attractive.
> 
> To address it differently, notice that if a CPU is woken up very often,
> this is not likely to be caused by timers in the first place (user space
> has a default timer slack of 50 us and there are relatively few timers
> with a deadline shorter than several microseconds in the kernel) and
> even if it were the case, the potential benefit from using a deep idle
> state would then be questionable for latency reasons.  Therefore, if the
> majority of CPU wakeups occur within several microseconds, it can be
> assumed that all wakeups in that range are non-timer and the sleep
> length need not be determined.
> 
> Accordingly, introduce a new metric for counting wakeups with the
> measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle
> state selection to skip the tick_nohz_get_sleep_length() invocation if
> idle state 0 has been selected or the target residency of the candidate
> idle state is below RESIDENCY_THRESHOLD_NS and the value of the new
> metric is at least 1/2 of the total event count.
> 
> Since the above requires the measured idle duration to be determined
> every time, except for the cases when one of the safety nets has
> triggered in which the wakeup is counted as a hit in the deepest
> idle state idle residency range, update the handling of those cases
> to avoid skipping the idle duration computation when the CPU wakeup
> is "genuine".
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> ---
>  drivers/cpuidle/governors/teo.c |   58 ++++++++++++++++++++++++----------------
>  1 file changed, 36 insertions(+), 22 deletions(-)
> 
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -129,6 +129,7 @@
>   * @state_bins: Idle state data bins for this CPU.
>   * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
>   * @tick_intercepts: "Intercepts" before TICK_NSEC.
> + * @short_idle: Wakeups after short idle periods.
>   */
>  struct teo_cpu {
>  	s64 time_span_ns;
> @@ -136,6 +137,7 @@
>  	struct teo_bin state_bins[CPUIDLE_STATE_MAX];
>  	unsigned int total;
>  	unsigned int tick_intercepts;
> +	unsigned int short_idle;

Maybe call these short_idles?

>  };
>  
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> @@ -152,12 +154,12 @@
>  	s64 target_residency_ns;
>  	u64 measured_ns;
>  
> -	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
> +	cpu_data->short_idle -= cpu_data->short_idle >> DECAY_SHIFT;
> +
> +	if (cpu_data->time_span_ns < 0) {
>  		/*
> -		 * This causes the wakeup to be counted as a hit regardless of
> -		 * regardless of the real idle duration which doesn't need to be
> -		 * computed because the wakeup has been close enough to an
> -		 * anticipated timer.
> +		 * If one of the safety nets has triggered, assume that this
> +		 * might have been a long sleep.
>  		 */
>  		measured_ns = U64_MAX;
>  	} else {
> @@ -177,10 +179,14 @@
>  		 * time, so take 1/2 of the exit latency as a very rough
>  		 * approximation of the average of it.
>  		 */
> -		if (measured_ns >= lat_ns)
> +		if (measured_ns >= lat_ns) {
>  			measured_ns -= lat_ns / 2;
> -		else
> +			if (measured_ns < RESIDENCY_THRESHOLD_NS)
> +				cpu_data->short_idle += PULSE;
> +		} else {
>  			measured_ns /= 2;
> +			cpu_data->short_idle += PULSE;
> +		}
>  	}
>  
>  	cpu_data->total = 0;
> @@ -419,27 +425,35 @@
>  	if (idx > constraint_idx)
>  		idx = constraint_idx;
>  
> -	if (!idx) {
> -		/*
> -		 * Query the sleep length to be able to count the wakeup as a
> -		 * hit if it is caused by a timer.
> -		 */
> -		cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
> -		goto out_tick;
> -	}
> -
>  	/*
> -	 * If state 0 is a polling one, check if the target residency of
> -	 * the current candidate state is low enough and skip the timers
> -	 * check in that case too.
> +	 * If either the candidate state is state 0 or its target residency is
> +	 * low enough, there is basically nothing more to do, but if the sleep
> +	 * length is not updated, the subsequent wakeup will be counted as an
> +	 * "intercept" which may be problematic in the cases when timer wakeups
> +	 * are dominant.  Namely, it may effectively prevent deeper idle states
> +	 * from being selected at one point even if no imminent timers are
> +	 * scheduled.
> +	 *
> +	 * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one
> +	 * CPU are unlikely (user space has a default 50 us slack value for
> +	 * hrtimers and there are relatively few timers with a lower deadline
> +	 * value in the kernel), and even if they did happene, the potential

s/happene/happen

> +	 * benefit from using a deep idle state in that case would be
> +	 * questionable anyway for latency reasons.  Thus if the measured idle
> +	 * duration falls into that range in the majority of cases, assume
> +	 * non-timer wakeups to be dominant and skip updating the sleep length
> +	 * to reduce latency.
>  	 */
> -	if ((drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
> -	    drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS)
> +	if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) &&
> +	    2 * cpu_data->short_idle >= cpu_data->total)
>  		goto out_tick;
>  
>  	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
>  	cpu_data->sleep_length_ns = duration_ns;
>  
> +	if (!idx)
> +		goto out_tick;
> +
>  	/*
>  	 * If the closest expected timer is before the target residency of the
>  	 * candidate state, a shallower one needs to be found.
> @@ -501,7 +515,7 @@
>  	if (dev->poll_time_limit ||
>  	    (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) {
>  		dev->poll_time_limit = false;
> -		cpu_data->time_span_ns = cpu_data->sleep_length_ns;
> +		cpu_data->time_span_ns = KTIME_MIN;
>  	} else {
>  		cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns;
>  	}
> 

Thanks, I like this approach.
Reviewed-by: Christian Loehle <christian.loehle@....com>