[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250212145054.GA1965539@joelnvbox>
Date: Wed, 12 Feb 2025 09:50:54 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
John Stultz <jstultz@...gle.com>,
Saravana Kannan <saravanak@...gle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Frederic Weisbecker <frederic@...nel.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Kconfig.hz: Change default HZ to 1000
On Mon, Feb 10, 2025 at 12:19:15AM +0000, Qais Yousef wrote:
> The frequency at which TICK happens is very important from scheduler
> perspective. There's a responsiveness trade-of that for interactive
> systems the current default is set too low.
Another thing that screws up pretty badly at least with pre-EEVDF CFS is the
extra lag that gets added to high nice value tasks due to the coarser tick
causes low nice value tasks to get an even longer time slice. I caught this
when tracing Android few years ago. ISTR, this was pretty bad almost to a
point of defeating fairness. Not sure if that shows with EEVDF though.
>
> Having a slow TICK frequency can lead to the following shortcomings in
> scheduler decisions:
>
> 1. Imprecise time slice
> -----------------------
>
> Preemption checks occur when a new task wakes up, on return from
> interrupt or at TICK. If we have N tasks running on the same CPU then as
> a worst case scenario these tasks will time slice every TICK regardless
> of their actual slice size.
>
> By default base_slice ends up being 3ms on many systems. But due to TICK
> being 4ms by default, tasks will end up slicing every 4ms instead in
> busy scenarios. It also makes the effectiveness of reducing the
> base_slice to a lower value like 2ms or 1ms pointless. It will allow new
> waking tasks to preempt sooner. But it will prevent timely cycling of
> tasks in busy scenarios. Which is an important and frequent scenario.
>
> 2. Delayed load_balance()
> -------------------------
>
> Scheduler task placement decision at wake up can easily become stale as
> more tasks wake up. load_balance() is the correction point to ensure the
> system is loaded optimally. And in the case of HMP systems tasks are
> migrated to a bigger CPU to meet their compute demand.
>
> Newidle balance can help alleviate the problem. But the worst case
> scenario is for the TICK to trigger the load_balance().
>
> 3. Delayed stats update
> -----------------------
>
> And subsequently delayed cpufreq updates and misfit detection (the need
> to move a task from little CPU to a big CPU in HMP systems).
>
> When a task is busy then as a worst case scenario the util signal will
> update every TICK. Since util signal is the main driver for our
> preferred governor - schedutil - and what drives EAS to decide if
> a task fits a CPU or needs to migrate to a bigger CPU, these delays can
> be detrimental to system responsiveness.
>
> ------------------------------------------------------------------------
>
> Note that the worst case scenario is an important and defining
> characteristic for interactive systems. It's all about the P90 and P95.
> Responsiveness IMHO is no longer a characteristic of a desktop system.
> Modern hardware and workloads are interactive generally and need better
> latencies. To my knowledge even servers run mixed workloads and serve
> a lot of users interactively.
>
> On Android and Desktop systems etc 120Hz is a common screen
> configuration. This gives tasks 8ms deadline to do their work. 4ms is
> half this time which makes the burden on making very correct decision at
> wake up stressed more than necessary. And it makes utilizing the system
> effectively to maintain best perf/watt harder. As an example [1] tries
> to fix our definition of DVFS headroom to be a function of TICK as it
> defines our worst case scenario of updating stats. The larger TICK means
> we have to be overly aggressive in going into higher frequencies if we
> want to ensure perf is not impacted. But if the task didn't consume all
> of its slice, we lost an opportunity to use a lower frequency and save
> power. Lower TICK value allows us to be smarter about our resource
> allocation to balance perf and power.
>
> Generally workloads working with ever smaller deadlines is not unique to
> UI pipeline. Everything is expected to finish work sooner and be more
> responsive.
>
> I believe HZ_250 was the default as a trade-off for battery power
> devices that might not be happy with frequent TICKS potentially draining
> the battery unnecessarily. But to my understanding the current state of
Actually, on x86, me and Steve did some debug on Chromebooks and we found
that HZ_250 actually increased power versus higher HZ. This was because
cpuidle governor changes C states on the tick, and by making it less
frequent, the CPU could be in a shallow C state for longer.
> NOHZ should be good enough to alleviate these concerns. And recent
> addition of RCU_LAZY further helps with keeping TICK quite in idle
> scenarios.
>
> As pointed out to me by Saravana though, the longer TICK did indirectly
> help with timer coalescing which means it could hide issues with
> drivers/tasks asking for frequent timers preventing entry to deeper idle
> states (4ms is a high value to allow entry to deeper idle state for many
> systems). But one can argue this is a problem with these drivers/tasks.
> And if the coalescing behavior is desired we can make it intentional
> rather than accidental.
I am not sure how much coalescing of timer-wheel events matter. My impression
is coalescing matters only for HRtimer since those can be more granular.
>
> The faster TICK might still result in higher power, but not due to TICK
> activities. The system is more responsive (as intended) and it is
> expected the residencies in higher freqs would be higher as they were
> accidentally being stuck at lower freqs. The series in [1] attempts to
> improve scheduler handling of responsiveness and give users/apps a way
> to better provide their needs, including opting out of getting adequate
> response (rampup_multiplier being 0 in the mentioned series).
>
> Since the default behavior might end up on many unwary users, ensure it
> matches what modern systems and workloads expect given that our NOHZ has
> moved a long way to keep TICKS tamed in idle scenarios.
>
> [1] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/
>
> Signed-off-by: Qais Yousef <qyousef@...alina.io>
> ---
> kernel/Kconfig.hz | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
> index 38ef6d06888e..c742c9298af3 100644
> --- a/kernel/Kconfig.hz
> +++ b/kernel/Kconfig.hz
> @@ -5,7 +5,7 @@
>
> choice
> prompt "Timer frequency"
> - default HZ_250
> + default HZ_1000
Its fine with me, but I wonder who else cares about HZ_250 default. I
certainly don't. And if someone really wants it for an odd reason, then can
just adjust the config for themselves.
Acked-by: Joel Fernandes <joelagnelf@...dia.com>
thanks,
- Joel
> help
> Allows the configuration of the timer frequency. It is customary
> to have the timer interrupt run at 1000 Hz but 100 Hz may be more
> --
> 2.34.1
>
Powered by blists - more mailing lists