[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z77CP01ZYdZ9rcZN@gpd3>
Date: Wed, 26 Feb 2025 08:26:55 +0100
From: Andrea Righi <arighi@...dia.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
John Stultz <jstultz@...gle.com>,
Saravana Kannan <saravanak@...gle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Frederic Weisbecker <frederic@...nel.org>,
Joel Fernandes <joelagnelf@...dia.com>,
David Laight <david.laight.linux@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] Kconfig.hz: Change default HZ to 1000
On Wed, Feb 26, 2025 at 12:08:09AM +0000, Qais Yousef wrote:
> The frequency at which TICK happens is very important from scheduler
> perspective. There's a responsiveness trade-of that for interactive
> systems the current default is set too low.
>
> Historically it was set to 250 to address throughput and power concerns.
> But these issues should no longer be true. Throughput is more sensitive
> to base_slice which can be controlled with task sched_attr::runtime. And
> current state of NOHZ and RCU_LAZY should make frequent TICKS not
> a problem from keeping CPUs at deep idle state to save power when the
> system doesn't any activities.
>
> Joel indicated that ChromeOS has seen power gains on x86 with HZ=1000.
> Andrea has done analysis at Ubuntu [1] which confirms that power is the
> same or better on x86 with no significant impact on performance.
> Phoronix has also conducted an experiment that shows performance is
> better in a number of use cases and slightly lower in others with no
> significant power impact [2]. Testing on Android environment shows that
> UI pipeline can have 54% and 13% less missed frames at 6.67% power cost
> due to increased responsiveness of util signal as explained below.
>
> Generally having a slow TICK frequency can lead to the following
> shortcomings in scheduler decisions:
>
> 1. Imprecise time slice
> -----------------------
>
> Preemption checks occur when a new task wakes up, on return from
> interrupt or at TICK. If we have N tasks running on the same CPU then as
> a worst case scenario these tasks will time slice every TICK regardless
> of their actual slice size.
>
> By default base_slice ends up being 3ms on many systems. But due to TICK
> being 4ms by default, tasks will end up slicing every 4ms instead in
> busy scenarios. It also makes the effectiveness of reducing the
> base_slice to a lower value like 2ms or 1ms pointless. It will allow new
> waking tasks to preempt sooner. But it will prevent timely cycling of
> tasks in busy scenarios. Which is an important and frequent scenario.
>
> 2. Delayed load_balance()
> -------------------------
>
> Scheduler task placement decision at wake up can easily become stale as
> more tasks wake up. load_balance() is the correction point to ensure the
> system is loaded optimally. And in the case of HMP systems tasks are
> migrated to a bigger CPU to meet their compute demand.
>
> Newidle balance can help alleviate the problem. But the worst case
> scenario is for the TICK to trigger the load_balance().
>
> 3. Delayed stats update
> -----------------------
>
> And subsequently delayed cpufreq updates and misfit detection (the need
> to move a task from little CPU to a big CPU in HMP systems).
>
> When a task is busy then as a worst case scenario the util signal will
> update every TICK. Since util signal is the main driver for our
> preferred governor - schedutil - and what drives EAS to decide if
> a task fits a CPU or needs to migrate to a bigger CPU, these delays can
> be detrimental to system responsiveness.
>
> ------------------------------------------------------------------------
>
> Note that the worst case scenario is an important and defining
> characteristic for interactive systems. It's all about the P90 and P95.
> Responsiveness IMHO is no longer a characteristic of a desktop system.
> Modern hardware and workloads are interactive generally and need better
> latencies. To my knowledge even servers run mixed workloads and serve
> a lot of users interactively.
>
> On Android and Desktop systems etc 120Hz is a common screen
> configuration. This gives tasks 8ms deadline to do their work. 4ms is
> half this time which makes the burden on making very correct decision at
> wake up stressed more than necessary. And it makes utilizing the system
> effectively to maintain best perf/watt harder. As an example [3] tries
> to fix our definition of DVFS headroom to be a function of TICK as it
> defines our worst case scenario of updating stats. The larger TICK means
> we have to be overly aggressive in going into higher frequencies if we
> want to ensure perf is not impacted. But if the task didn't consume all
> of its slice, we lost an opportunity to use a lower frequency and save
> power. Lower TICK value allows us to be smarter about our resource
> allocation to balance perf and power.
>
> Generally workloads working with ever smaller deadlines is not unique to
> UI pipeline. Everything is expected to finish work sooner and be more
> responsive.
>
> As pointed out to me by Saravana though, the longer TICK did indirectly
> help with timers delayed trigger which means it could hide issues with
> drivers/tasks asking for frequent timers preventing entry to deeper idle
> states (4ms is a high value to allow entry to deeper idle state for many
> systems). But one can argue this is a problem with these drivers/tasks.
> And if the delayed trigger behavior is desired we can make it
> intentional rather than accidental.
>
> The faster TICK might still result in higher power, but not due to TICK
> activities. The impact is more prominent with schedutil governor. The system
> is more responsive (as intended) and it is expected the residencies in higher
> freqs would be higher as they were accidentally being stuck at lower freqs. The
> series in [3] attempts to improve scheduler handling of responsiveness and give
> users/apps a way to better provide/get their needs.
>
> Since the default behavior might end up on many unwary users, ensure it
> matches what modern systems and workloads expect given that our NOHZ has
> moved a long way to keep TICKS tamed in idle scenarios.
>
> Noteworthy that some folks reported that PREEMPT_LAZY helps undo the
> slight throughput loss in some benchmarks.
>
> [1] https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255
> [2] https://www.phoronix.com/news/Linux-250Hz-1000Hz-Kernel-2025
> [3] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/
>
> Acked-by: Joel Fernandes <joelagnelf@...dia.com>
> Acked-by : Vincent Guittot <vincent.guittot@...aro.org>
> Signed-off-by: Qais Yousef <qyousef@...alina.io>
FWIW, since I proposed the same in the Ubuntu generic kernel:
Acked-by: Andrea Righi <arighi@...dia.com>
Thanks,
-Andrea
> ---
>
> Changes in v2:
> * Update commit message to include some data
> * Add Acked-bys
>
> kernel/Kconfig.hz | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
> index 38ef6d06888e..c742c9298af3 100644
> --- a/kernel/Kconfig.hz
> +++ b/kernel/Kconfig.hz
> @@ -5,7 +5,7 @@
>
> choice
> prompt "Timer frequency"
> - default HZ_250
> + default HZ_1000
> help
> Allows the configuration of the timer frequency. It is customary
> to have the timer interrupt run at 1000 Hz but 100 Hz may be more
> --
> 2.34.1
>
Powered by blists - more mailing lists