lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250128063301.3879317-1-jstultz@google.com>
Date: Mon, 27 Jan 2025 22:32:52 -0800
From: John Stultz <jstultz@...gle.com>
To: LKML <linux-kernel@...r.kernel.org>
Cc: John Stultz <jstultz@...gle.com>, Anna-Maria Behnsen <anna-maria@...utronix.de>, 
	Frederic Weisbecker <frederic@...nel.org>, Ingo Molnar <mingo@...nel.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Stephen Boyd <sboyd@...nel.org>, 
	Yury Norov <yury.norov@...il.com>, Bitao Hu <yaoma@...ux.alibaba.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, kernel-team@...roid.com
Subject: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at boot time

The HZ value has long been a compile time constant. This is
really useful as there is a lot of hotpath code that uses HZ
when setting timers (needing to convert nanoseconds to ticks),
thus the division is much faster with a compile time constant
divisor.

However, having to select the system HZ value at build time is
somewhat limiting. Distros have to make choices for their users
as to what the best HZ value would be balancing latency and
power usage.

With Android, this is a major issue, as we have one GKI binary
that runs across a wide array of devices from top of the line
flagship phones to watches. Balancing the choice for HZ is
difficult, we currently have HZ=250, but some devices would love
to have HZ=1000, while other devices aren’t willing to pay the
power cost of 4x the timer slots, resulting in shorter idle
times.

(As an aside, some suggested RCU_LAZY would avoid the cost
of bumping to HZ=1000, and it does indeed help a good bit, but
we still see higher power usage compared with HZ=250)

So I’ve been thinking and talking about an idea for awhile to
try to address this: DynamicHZ.

The idea is we just set HZ=1000, with 1ms granular timers.
However, using a boot time argument, we can optionally program
the clockevent that generates the timer tick to fire at a lower
frequency. Effectively this is just like the lost ticks handling
done for SMIs. The jiffies accounting is handled by the time the
clocksource sees pass, so time progresses properly, and timers
for all the ticks we skip will fire when the next timer
interrupt occurs. So with dyn_hz=250, the interrupt fires every
fourth tick and with dyn_hz=100, every 10th.

So far, this approach has seemed to work ok.

One area that needed adjustments was the cputime accounting, as
it assumes we only account one tick per interrupt, so I’ve
reworked some of that logic to pipe through the actual tick
count.

Once that was addressed, in testing with HZ=1000, dyn_hz=250,
all of the time/timer tests I've tried seem to be working
properly. I don’t see any performance regressions on tests like
Geekbench(compared to HZ=250), nor have I observed any power
regressions.

Though as the tick and scheduler code are intertwined and
neither subsystem is particularly simple, I expect I may still
be missing or forgetting things. So I wanted to send this out
for review and feedback.

I’d love to hear your thoughts or concerns!

Also, I've not yet gotten this to work for the fixed
periodic-tick paths (you need a oneshot capable clockevent).
Mostly because in that case we always just increment by a single
tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
periodic tick count is pretty simple (4 ticks, 10 ticks). But
for dyn_hz=300, or other possible values, it doesn’t evenly
divide, so we would have to do a 3,3,4,3,3,4 style interval to
stay on time and I’ve not yet thought through how to do
remainder handling efficiently yet.

thanks
-john

Cc: Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc: Frederic Weisbecker <frederic@...nel.org>
Cc: Ingo Molnar <mingo@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Stephen Boyd <sboyd@...nel.org>
Cc: Yury Norov <yury.norov@...il.com>
Cc: Bitao Hu <yaoma@...ux.alibaba.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: kernel-team@...roid.com

John Stultz (3):
  time/tick: Pipe tick count down through cputime accounting
  time/tick: Introduce a dyn_hz boot option
  Kconfig: Add CONFIG_DYN_HZ_DEFAULT to specify the default dynhz= boot
    option value

 include/linux/kernel_stat.h |  4 ++--
 include/linux/tick.h        | 11 +++++++++--
 kernel/Kconfig.hz           | 19 +++++++++++++++++++
 kernel/sched/cputime.c      |  6 +++---
 kernel/time/tick-common.c   | 32 +++++++++++++++++++++++++++++++-
 kernel/time/tick-legacy.c   |  2 +-
 kernel/time/tick-sched.c    | 33 ++++++++++++++++++---------------
 kernel/time/timekeeping.h   |  2 +-
 kernel/time/timer.c         |  4 ++--
 9 files changed, 86 insertions(+), 27 deletions(-)

-- 
2.48.1.262.g85cc9f2d1e-goog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ