linux-kernel - Re: [PATCH] Kconfig.hz: Change default HZ to 1000

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGETcx9T-Fz-AN0GgOCmT+xZ3JMehmz-cDf5wEm7a1QuBHWUxA@mail.gmail.com>
Date: Thu, 13 Feb 2025 00:24:58 -0800
From: Saravana Kannan <saravanak@...gle.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	John Stultz <jstultz@...gle.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Frederic Weisbecker <frederic@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Kconfig.hz: Change default HZ to 1000

On Sun, Feb 9, 2025 at 4:19 PM Qais Yousef <qyousef@...alina.io> wrote:
>
> The frequency at which TICK happens is very important from scheduler
> perspective. There's a responsiveness trade-of that for interactive
> systems the current default is set too low.
>
> Having a slow TICK frequency can lead to the following shortcomings in
> scheduler decisions:
>
> 1. Imprecise time slice
> -----------------------
>
> Preemption checks occur when a new task wakes up, on return from
> interrupt or at TICK. If we have N tasks running on the same CPU then as
> a worst case scenario these tasks will time slice every TICK regardless
> of their actual slice size.
>
> By default base_slice ends up being 3ms on many systems. But due to TICK
> being 4ms by default, tasks will end up slicing every 4ms instead in
> busy scenarios.  It also makes the effectiveness of reducing the
> base_slice to a lower value like 2ms or 1ms pointless. It will allow new
> waking tasks to preempt sooner.  But it will prevent timely cycling of
> tasks in busy scenarios. Which is an important and frequent scenario.
>
> 2. Delayed load_balance()
> -------------------------
>
> Scheduler task placement decision at wake up can easily become stale as
> more tasks wake up. load_balance() is the correction point to ensure the
> system is loaded optimally. And in the case of HMP systems tasks are
> migrated to a bigger CPU to meet their compute demand.
>
> Newidle balance can help alleviate the problem. But the worst case
> scenario is for the TICK to trigger the load_balance().
>
> 3. Delayed stats update
> -----------------------
>
> And subsequently delayed cpufreq updates and misfit detection (the need
> to move a task from little CPU to a big CPU in HMP systems).
>
> When a task is busy then as a worst case scenario the util signal will
> update every TICK. Since util signal is the main driver for our
> preferred governor - schedutil - and what drives EAS to decide if
> a task fits a CPU or needs to migrate to a bigger CPU, these delays can
> be detrimental to system responsiveness.
>
> ------------------------------------------------------------------------
>
> Note that the worst case scenario is an important and defining
> characteristic for interactive systems. It's all about the P90 and P95.
> Responsiveness IMHO is no longer a characteristic of a desktop system.
> Modern hardware and workloads are interactive generally and need better
> latencies. To my knowledge even servers run mixed workloads and serve
> a lot of users interactively.
>
> On Android and Desktop systems etc 120Hz is a common screen
> configuration. This gives tasks 8ms deadline to do their work. 4ms is
> half this time which makes the burden on making very correct decision at
> wake up stressed more than necessary. And it makes utilizing the system
> effectively to maintain best perf/watt harder. As an example [1] tries
> to fix our definition of DVFS headroom to be a function of TICK as it
> defines our worst case scenario of updating stats. The larger TICK means
> we have to be overly aggressive in going into higher frequencies if we
> want to ensure perf is not impacted. But if the task didn't consume all
> of its slice, we lost an opportunity to use a lower frequency and save
> power. Lower TICK value allows us to be smarter about our resource
> allocation to balance perf and power.
>
> Generally workloads working with ever smaller deadlines is not unique to
> UI pipeline. Everything is expected to finish work sooner and be more
> responsive.
>
> I believe HZ_250 was the default as a trade-off for battery power
> devices that might not be happy with frequent TICKS potentially draining
> the battery unnecessarily. But to my understanding the current state of
> NOHZ should be good enough to alleviate these concerns. And recent
> addition of RCU_LAZY further helps with keeping TICK quite in idle
> scenarios.
>
> As pointed out to me by Saravana though, the longer TICK did indirectly
> help with timer coalescing which means it could hide issues with
> drivers/tasks asking for frequent timers preventing entry to deeper idle
> states (4ms is a high value to allow entry to deeper idle state for many
> systems). But one can argue this is a problem with these drivers/tasks.
> And if the coalescing behavior is desired we can make it intentional
> rather than accidental.
>
> The faster TICK might still result in higher power, but not due to TICK
> activities. The system is more responsive (as intended) and it is
> expected the residencies in higher freqs would be higher as they were
> accidentally being stuck at lower freqs. The series in [1] attempts to
> improve scheduler handling of responsiveness and give users/apps a way
> to better provide their needs, including opting out of getting adequate
> response (rampup_multiplier being 0 in the mentioned series).
>
> Since the default behavior might end up on many unwary users, ensure it
> matches what modern systems and workloads expect given that our NOHZ has
> moved a long way to keep TICKS tamed in idle scenarios.
>
> [1] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/
>
> Signed-off-by: Qais Yousef <qyousef@...alina.io>
> ---
>  kernel/Kconfig.hz | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
> index 38ef6d06888e..c742c9298af3 100644
> --- a/kernel/Kconfig.hz
> +++ b/kernel/Kconfig.hz
> @@ -5,7 +5,7 @@
>
>  choice
>         prompt "Timer frequency"
> -       default HZ_250
> +       default HZ_1000

This is going to mess up power for tons of IOT and low power devices.
I think we should leave the default alone and set the config in the
device specific defconfig. Even on Android, for some use cases, this
causes ~7% CPU power increase. This also causes more CPU wakeups
because jiffy based timers that are set for t + 1ms, t + 2ms, t+ 3ms,
t + 4ms would all get grouped into a t + 4ms HZ wakeup, but with 1000
HZ timer, it'd cause 4 separate wakeups.

I'd like to Nack this.

-Saravana

>         help
>          Allows the configuration of the timer frequency. It is customary
>          to have the timer interrupt run at 1000 Hz but 100 Hz may be more
> --
> 2.34.1
>