[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87cyg67up9.ffs@tglx>
Date: Tue, 28 Jan 2025 17:46:10 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>
Cc: John Stultz <jstultz@...gle.com>, Anna-Maria Behnsen
<anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>,
Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel
Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Stephen Boyd <sboyd@...nel.org>, Yury Norov <yury.norov@...il.com>, Bitao
Hu <yaoma@...ux.alibaba.com>, Andrew Morton <akpm@...ux-foundation.org>,
kernel-team@...roid.com
Subject: Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at
boot time
John!
On Mon, Jan 27 2025 at 22:32, John Stultz wrote:
> The HZ value has long been a compile time constant. This is
> really useful as there is a lot of hotpath code that uses HZ
> when setting timers (needing to convert nanoseconds to ticks),
> thus the division is much faster with a compile time constant
> divisor.
To some extent, yes. Though the meaning of the 'HZ tick' has become
pretty blury over time.
If you actually look at the timer wheel timer usage, then the vast
majority is converting SI units based timeouts/delays to ticks and they
do not care about the actual tick frequency at all. Just grep for
'_to_jiffies()' aside of the tons of 'HZ * $N' places which are
sprinkled across the code base.
Code which relies on accurate wakeups is mostly using hrtimers anyway.
The only two places which are truly tick bound are the scheduler and the
timer wheel itself, where the latter is not really about it.
> One area that needed adjustments was the cputime accounting, as
> it assumes we only account one tick per interrupt, so I’ve
> reworked some of that logic to pipe through the actual tick
> count.
And you got that patently wrong...
> However, having to select the system HZ value at build time is
> somewhat limiting. Distros have to make choices for their users
> as to what the best HZ value would be balancing latency and
> power usage.
>
> With Android, this is a major issue, as we have one GKI binary
> that runs across a wide array of devices from top of the line
> flagship phones to watches. Balancing the choice for HZ is
> difficult, we currently have HZ=250, but some devices would love
> to have HZ=1000, while other devices aren’t willing to pay the
> power cost of 4x the timer slots, resulting in shorter idle
> times.
The shorter idle times are because timer wheel timers wake up more
accurately with HZ=1000 and not because the scheduler is more agressive?
> Also, I've not yet gotten this to work for the fixed
> periodic-tick paths (you need a oneshot capable clockevent).
Which is not a given on the museum pieces we insist to support just
because we can. But with periodic timers it should be easy enough to
make clockevents::set_state_periodic() take a tick frequency argument
and convert the ~70 callbacks to handle it.
> Mostly because in that case we always just increment by a single
> tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
> periodic tick count is pretty simple (4 ticks, 10 ticks). But
> for dyn_hz=300, or other possible values, it doesn’t evenly
> divide, so we would have to do a 3,3,4,3,3,4 style interval to
> stay on time and I’ve not yet thought through how to do
> remainder handling efficiently yet.
I doubt you need that. Programming it to the next closest value is good
enough and there is no reason to overengineer it for a marginal benefit
of "accuracy". But that's obviously not really working with your chosen
minimalistic approach.
Aside of that, using random HZ values is a pretty academic exercise and
HZ=300 had been introduced for multimedia to cater for 30FPS. But that
was long ago when high resolution timers, NOHZ and modern graphic
devices did not exist.
I seriously doubt that HZ=300 has any actual advantage on modern
systems. Sure, I know that SteamOS uses HZ=300, but AFAICT from public
discussions this just caters to the HZ=300 myth and is not backed by any
factual evidence that HZ=300 is so superior. Quite the contrary there
are enough people who actually want HZ=1000 for better responsiveness.
But let me come back to your proposed hack, which is admittedly cute.
Though I'm not really convinced that it is more than a bandaid, which
papers over the most obvious places to make it "work".
Let's take a step back and look at the usage of 'HZ':
1) Jiffies and related timer wheel interfaces
jiffies should just go away completely and be replaced by a simple
millisecond counter, which is accessible in the same way as
jiffies today.
That removes the bulk of HZ usage all over the place and makes the
usage sites simpler as the interfaces just use SI units and the
gazillions (~4500 to jiffies and ~1000 from jiffies) back and
forth conversions just go away.
We obviously need to keep the time_before/after/*() interfaces for
32bit, unless we decide to limit the uptime for 32-bit machines to
~8 years and force reboot them before the counter can overflow :)
On the timer wheel side that means that the base granularity is
always 1ms, which only affects the maximum timeout. The timer
expiry is just batched on the actual tick frequency and should not
have any other side effects except for slightly moving the
granularity boundaries depending on the tick frequency. But that's
not any different from the hard coded HZ values.
The other minor change is to make the next timer interrupt
retrieval for NOHZ round up the next event to the tick boundary,
but that's trivial enough.
2) Clock events
Periodic mode is trivial to fix with a tick frequency argument to
the set_state_periodic() callback.
Oneshot mode just works as it programs the hardware to the next
closest event. Not much different from the current situation with a
hard-coded HZ value.
3) Accounting
The accounting has to be seperated from the jiffies advancement and
it has to feed the delta to the last tick in nanoseconds into the
accounting path, which internally operates in nanoseconds already
today.
4) Scheduler
I leave that part to Peter as he definitely has a better overview
of what needs to be done than me.
Thanks,
tglx
Powered by blists - more mailing lists