lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20250210001915.123424-1-qyousef@layalina.io>
Date: Mon, 10 Feb 2025 00:19:15 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: Juri Lelli <juri.lelli@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	John Stultz <jstultz@...gle.com>,
	Saravana Kannan <saravanak@...gle.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Frederic Weisbecker <frederic@...nel.org>,
	linux-kernel@...r.kernel.org,
	Qais Yousef <qyousef@...alina.io>
Subject: [PATCH] Kconfig.hz: Change default HZ to 1000

The frequency at which TICK happens is very important from scheduler
perspective. There's a responsiveness trade-of that for interactive
systems the current default is set too low.

Having a slow TICK frequency can lead to the following shortcomings in
scheduler decisions:

1. Imprecise time slice
-----------------------

Preemption checks occur when a new task wakes up, on return from
interrupt or at TICK. If we have N tasks running on the same CPU then as
a worst case scenario these tasks will time slice every TICK regardless
of their actual slice size.

By default base_slice ends up being 3ms on many systems. But due to TICK
being 4ms by default, tasks will end up slicing every 4ms instead in
busy scenarios.  It also makes the effectiveness of reducing the
base_slice to a lower value like 2ms or 1ms pointless. It will allow new
waking tasks to preempt sooner.  But it will prevent timely cycling of
tasks in busy scenarios. Which is an important and frequent scenario.

2. Delayed load_balance()
-------------------------

Scheduler task placement decision at wake up can easily become stale as
more tasks wake up. load_balance() is the correction point to ensure the
system is loaded optimally. And in the case of HMP systems tasks are
migrated to a bigger CPU to meet their compute demand.

Newidle balance can help alleviate the problem. But the worst case
scenario is for the TICK to trigger the load_balance().

3. Delayed stats update
-----------------------

And subsequently delayed cpufreq updates and misfit detection (the need
to move a task from little CPU to a big CPU in HMP systems).

When a task is busy then as a worst case scenario the util signal will
update every TICK. Since util signal is the main driver for our
preferred governor - schedutil - and what drives EAS to decide if
a task fits a CPU or needs to migrate to a bigger CPU, these delays can
be detrimental to system responsiveness.

------------------------------------------------------------------------

Note that the worst case scenario is an important and defining
characteristic for interactive systems. It's all about the P90 and P95.
Responsiveness IMHO is no longer a characteristic of a desktop system.
Modern hardware and workloads are interactive generally and need better
latencies. To my knowledge even servers run mixed workloads and serve
a lot of users interactively.

On Android and Desktop systems etc 120Hz is a common screen
configuration. This gives tasks 8ms deadline to do their work. 4ms is
half this time which makes the burden on making very correct decision at
wake up stressed more than necessary. And it makes utilizing the system
effectively to maintain best perf/watt harder. As an example [1] tries
to fix our definition of DVFS headroom to be a function of TICK as it
defines our worst case scenario of updating stats. The larger TICK means
we have to be overly aggressive in going into higher frequencies if we
want to ensure perf is not impacted. But if the task didn't consume all
of its slice, we lost an opportunity to use a lower frequency and save
power. Lower TICK value allows us to be smarter about our resource
allocation to balance perf and power.

Generally workloads working with ever smaller deadlines is not unique to
UI pipeline. Everything is expected to finish work sooner and be more
responsive.

I believe HZ_250 was the default as a trade-off for battery power
devices that might not be happy with frequent TICKS potentially draining
the battery unnecessarily. But to my understanding the current state of
NOHZ should be good enough to alleviate these concerns. And recent
addition of RCU_LAZY further helps with keeping TICK quite in idle
scenarios.

As pointed out to me by Saravana though, the longer TICK did indirectly
help with timer coalescing which means it could hide issues with
drivers/tasks asking for frequent timers preventing entry to deeper idle
states (4ms is a high value to allow entry to deeper idle state for many
systems). But one can argue this is a problem with these drivers/tasks.
And if the coalescing behavior is desired we can make it intentional
rather than accidental.

The faster TICK might still result in higher power, but not due to TICK
activities. The system is more responsive (as intended) and it is
expected the residencies in higher freqs would be higher as they were
accidentally being stuck at lower freqs. The series in [1] attempts to
improve scheduler handling of responsiveness and give users/apps a way
to better provide their needs, including opting out of getting adequate
response (rampup_multiplier being 0 in the mentioned series).

Since the default behavior might end up on many unwary users, ensure it
matches what modern systems and workloads expect given that our NOHZ has
moved a long way to keep TICKS tamed in idle scenarios.

[1] https://lore.kernel.org/lkml/20240820163512.1096301-6-qyousef@layalina.io/

Signed-off-by: Qais Yousef <qyousef@...alina.io>
---
 kernel/Kconfig.hz | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 38ef6d06888e..c742c9298af3 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -5,7 +5,7 @@
 
 choice
 	prompt "Timer frequency"
-	default HZ_250
+	default HZ_1000
 	help
 	 Allows the configuration of the timer frequency. It is customary
 	 to have the timer interrupt run at 1000 Hz but 100 Hz may be more
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ