lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87cyg67up9.ffs@tglx>
Date: Tue, 28 Jan 2025 17:46:10 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>
Cc: John Stultz <jstultz@...gle.com>, Anna-Maria Behnsen
 <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>,
 Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel
 Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Stephen Boyd <sboyd@...nel.org>, Yury Norov <yury.norov@...il.com>, Bitao
 Hu <yaoma@...ux.alibaba.com>, Andrew Morton <akpm@...ux-foundation.org>,
 kernel-team@...roid.com
Subject: Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at
 boot time

John!

On Mon, Jan 27 2025 at 22:32, John Stultz wrote:

> The HZ value has long been a compile time constant. This is
> really useful as there is a lot of hotpath code that uses HZ
> when setting timers (needing to convert nanoseconds to ticks),
> thus the division is much faster with a compile time constant
> divisor.

To some extent, yes. Though the meaning of the 'HZ tick' has become
pretty blury over time.

If you actually look at the timer wheel timer usage, then the vast
majority is converting SI units based timeouts/delays to ticks and they
do not care about the actual tick frequency at all. Just grep for
'_to_jiffies()' aside of the tons of 'HZ * $N' places which are
sprinkled across the code base.

Code which relies on accurate wakeups is mostly using hrtimers anyway.

The only two places which are truly tick bound are the scheduler and the
timer wheel itself, where the latter is not really about it.

> One area that needed adjustments was the cputime accounting, as
> it assumes we only account one tick per interrupt, so I’ve
> reworked some of that logic to pipe through the actual tick
> count.

And you got that patently wrong...

> However, having to select the system HZ value at build time is
> somewhat limiting. Distros have to make choices for their users
> as to what the best HZ value would be balancing latency and
> power usage.
>
> With Android, this is a major issue, as we have one GKI binary
> that runs across a wide array of devices from top of the line
> flagship phones to watches. Balancing the choice for HZ is
> difficult, we currently have HZ=250, but some devices would love
> to have HZ=1000, while other devices aren’t willing to pay the
> power cost of 4x the timer slots, resulting in shorter idle
> times.

The shorter idle times are because timer wheel timers wake up more
accurately with HZ=1000 and not because the scheduler is more agressive?

> Also, I've not yet gotten this to work for the fixed
> periodic-tick paths (you need a oneshot capable clockevent).

Which is not a given on the museum pieces we insist to support just
because we can. But with periodic timers it should be easy enough to
make clockevents::set_state_periodic() take a tick frequency argument
and convert the ~70 callbacks to handle it.

> Mostly because in that case we always just increment by a single
> tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
> periodic tick count is pretty simple (4 ticks, 10 ticks). But
> for dyn_hz=300, or other possible values, it doesn’t evenly
> divide, so we would have to do a 3,3,4,3,3,4 style interval to
> stay on time and I’ve not yet thought through how to do
> remainder handling efficiently yet.

I doubt you need that. Programming it to the next closest value is good
enough and there is no reason to overengineer it for a marginal benefit
of "accuracy". But that's obviously not really working with your chosen
minimalistic approach.

Aside of that, using random HZ values is a pretty academic exercise and
HZ=300 had been introduced for multimedia to cater for 30FPS. But that
was long ago when high resolution timers, NOHZ and modern graphic
devices did not exist.

I seriously doubt that HZ=300 has any actual advantage on modern
systems. Sure, I know that SteamOS uses HZ=300, but AFAICT from public
discussions this just caters to the HZ=300 myth and is not backed by any
factual evidence that HZ=300 is so superior. Quite the contrary there
are enough people who actually want HZ=1000 for better responsiveness.

But let me come back to your proposed hack, which is admittedly cute.
Though I'm not really convinced that it is more than a bandaid, which
papers over the most obvious places to make it "work".

Let's take a step back and look at the usage of 'HZ':

  1) Jiffies and related timer wheel interfaces

     jiffies should just go away completely and be replaced by a simple
     millisecond counter, which is accessible in the same way as
     jiffies today.

     That removes the bulk of HZ usage all over the place and makes the
     usage sites simpler as the interfaces just use SI units and the
     gazillions (~4500 to jiffies and ~1000 from jiffies) back and
     forth conversions just go away.

     We obviously need to keep the time_before/after/*() interfaces for
     32bit, unless we decide to limit the uptime for 32-bit machines to
     ~8 years and force reboot them before the counter can overflow :)

     On the timer wheel side that means that the base granularity is
     always 1ms, which only affects the maximum timeout. The timer
     expiry is just batched on the actual tick frequency and should not
     have any other side effects except for slightly moving the
     granularity boundaries depending on the tick frequency. But that's
     not any different from the hard coded HZ values.

     The other minor change is to make the next timer interrupt
     retrieval for NOHZ round up the next event to the tick boundary,
     but that's trivial enough.

  2) Clock events

     Periodic mode is trivial to fix with a tick frequency argument to
     the set_state_periodic() callback.

     Oneshot mode just works as it programs the hardware to the next
     closest event. Not much different from the current situation with a
     hard-coded HZ value.

  3) Accounting

     The accounting has to be seperated from the jiffies advancement and
     it has to feed the delta to the last tick in nanoseconds into the
     accounting path, which internally operates in nanoseconds already
     today.

  4) Scheduler

     I leave that part to Peter as he definitely has a better overview
     of what needs to be done than me.

Thanks,

        tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ