linux-kernel - Re: [PATCH v2] Kconfig.hz: Change default HZ to 1000

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z77CP01ZYdZ9rcZN@gpd3>
Date: Wed, 26 Feb 2025 08:26:55 +0100
From: Andrea Righi <arighi@...dia.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	John Stultz <jstultz@...gle.com>,
	Saravana Kannan <saravanak@...gle.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Frederic Weisbecker <frederic@...nel.org>,
	Joel Fernandes <joelagnelf@...dia.com>,
	David Laight <david.laight.linux@...il.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] Kconfig.hz: Change default HZ to 1000

On Wed, Feb 26, 2025 at 12:08:09AM +0000, Qais Yousef wrote:
> The frequency at which TICK happens is very important from scheduler
> perspective. There's a responsiveness trade-of that for interactive
> systems the current default is set too low.
> 
> Historically it was set to 250 to address throughput and power concerns.
> But these issues should no longer be true. Throughput is more sensitive
> to base_slice which can be controlled with task sched_attr::runtime. And
> current state of NOHZ and RCU_LAZY should make frequent TICKS not
> a problem from keeping CPUs at deep idle state to save power when the
> system doesn't any activities.
> 
> Joel indicated that ChromeOS has seen power gains on x86 with HZ=1000.
> Andrea has done analysis at Ubuntu [1] which confirms that power is the
> same or better on x86  with no significant impact on performance.
> Phoronix has also conducted an experiment that shows performance is
> better in a number of use cases and slightly lower in others with no
> significant power impact [2]. Testing on Android environment shows that
> UI pipeline can have 54% and 13% less missed frames at 6.67% power cost
> due to increased responsiveness of util signal as explained below.
> 
> Generally having a slow TICK frequency can lead to the following
> shortcomings in scheduler decisions:
> 
> 1. Imprecise time slice
> -----------------------
> 
> Preemption checks occur when a new task wakes up, on return from
> interrupt or at TICK. If we have N tasks running on the same CPU then as
> a worst case scenario these tasks will time slice every TICK regardless
> of their actual slice size.
> 
> By default base_slice ends up being 3ms on many systems. But due to TICK
> being 4ms by default, tasks will end up slicing every 4ms instead in
> busy scenarios.  It also makes the effectiveness of reducing the
> base_slice to a lower value like 2ms or 1ms pointless. It will allow new
> waking tasks to preempt sooner.  But it will prevent timely cycling of
> tasks in busy scenarios. Which is an important and frequent scenario.
> 
> 2. Delayed load_balance()
> -------------------------
> 
> Scheduler task placement decision at wake up can easily become stale as
> more tasks wake up. load_balance() is the correction point to ensure the
> system is loaded optimally. And in the case of HMP systems tasks are
> migrated to a bigger CPU to meet their compute demand.
> 
> Newidle balance can help alleviate the problem. But the worst case
> scenario is for the TICK to trigger the load_balance().
> 
> 3. Delayed stats update
> -----------------------
> 
> And subsequently delayed cpufreq updates and misfit detection (the need
> to move a task from little CPU to a big CPU in HMP systems).
> 
> When a task is busy then as a worst case scenario the util signal will
> update every TICK. Since util signal is the main driver for our
> preferred governor - schedutil - and what drives EAS to decide if
> a task fits a CPU or needs to migrate to a bigger CPU, these delays can
> be detrimental to system responsiveness.
> 
> ------------------------------------------------------------------------
> 
> Note that the worst case scenario is an important and defining
> characteristic for interactive systems. It's all about the P90 and P95.
> Responsiveness IMHO is no longer a characteristic of a desktop system.
> Modern hardware and workloads are interactive generally and need better
> latencies. To my knowledge even servers run mixed workloads and serve
> a lot of users interactively.
> 
> On Android and Desktop systems etc 120Hz is a common screen
> configuration. This gives tasks 8ms deadline to do their work. 4ms is
> half this time which makes the burden on making very correct decision at
> wake up stressed more than necessary. And it makes utilizing the system
> effectively to maintain best perf/watt harder. As an example [3] tries
> to fix our definition of DVFS headroom to be a function of TICK as it
> defines our worst case scenario of updating stats. The larger TICK means
> we have to be overly aggressive in going into higher frequencies if we
> want to ensure perf is not impacted. But if the task didn't consume all
> of its slice, we lost an opportunity to use a lower frequency and save
> power. Lower TICK value allows us to be smarter about our resource
> allocation to balance perf and power.
> 
> Generally workloads working with ever smaller deadlines is not unique to
> UI pipeline. Everything is expected to finish work sooner and be more
> responsive.
> 
> As pointed out to me by Saravana though, the longer TICK did indirectly
> help with timers delayed trigger which means it could hide issues with
> drivers/tasks asking for frequent timers preventing entry to deeper idle
> states (4ms is a high value to allow entry to deeper idle state for many
> systems). But one can argue this is a problem with these drivers/tasks.
> And if the delayed trigger behavior is desired we can make it
> intentional rather than accidental.
> 
> The faster TICK might still result in higher power, but not due to TICK
> activities. The impact is more prominent with schedutil governor. The system
> is more responsive (as intended) and it is expected the residencies in higher
> freqs would be higher as they were accidentally being stuck at lower freqs. The
> series in [3] attempts to improve scheduler handling of responsiveness and give
> users/apps a way to better provide/get their needs.
> 
> Since the default behavior might end up on many unwary users, ensure it
> matches what modern systems and workloads expect given that our NOHZ has
> moved a long way to keep TICKS tamed in idle scenarios.
> 
> Noteworthy that some folks reported that PREEMPT_LAZY helps undo the
> slight throughput loss in some benchmarks.
> 
> [1] https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255
> [2] https://www.phoronix.com/news/Linux-250Hz-1000Hz-Kernel-2025
> [3] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/
> 
> Acked-by: Joel Fernandes <joelagnelf@...dia.com>
> Acked-by : Vincent Guittot <vincent.guittot@...aro.org>
> Signed-off-by: Qais Yousef <qyousef@...alina.io>

FWIW, since I proposed the same in the Ubuntu generic kernel:

Acked-by: Andrea Righi <arighi@...dia.com>

Thanks,
-Andrea

> ---
> 
> Changes in v2:
> 	* Update commit message to include some data
> 	* Add Acked-bys
> 
>  kernel/Kconfig.hz | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
> index 38ef6d06888e..c742c9298af3 100644
> --- a/kernel/Kconfig.hz
> +++ b/kernel/Kconfig.hz
> @@ -5,7 +5,7 @@
>  
>  choice
>  	prompt "Timer frequency"
> -	default HZ_250
> +	default HZ_1000
>  	help
>  	 Allows the configuration of the timer frequency. It is customary
>  	 to have the timer interrupt run at 1000 Hz but 100 Hz may be more
> -- 
> 2.34.1
>