[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCon7mFR5b6hjFNH36ARJ+LLoNkw-=hF_ovsy83UUbRqXw@mail.gmail.com>
Date: Tue, 28 Jan 2025 22:10:05 -0800
From: John Stultz <jstultz@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: LKML <linux-kernel@...r.kernel.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>,
Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Stephen Boyd <sboyd@...nel.org>,
Yury Norov <yury.norov@...il.com>, Bitao Hu <yaoma@...ux.alibaba.com>,
Andrew Morton <akpm@...ux-foundation.org>, kernel-team@...roid.com,
Saravana Kannan <saravanak@...gle.com>, Samuel Wu <wusamuel@...gle.com>,
Qais Yousef <qyousef@...gle.com>
Subject: Re: [RFC][PATCH 0/3] DynamicHZ: Configuring the timer tick rate at
boot time
On Tue, Jan 28, 2025 at 8:46 AM Thomas Gleixner <tglx@...utronix.de> wrote:
> On Mon, Jan 27 2025 at 22:32, John Stultz wrote:
> > With Android, this is a major issue, as we have one GKI binary
> > that runs across a wide array of devices from top of the line
> > flagship phones to watches. Balancing the choice for HZ is
> > difficult, we currently have HZ=250, but some devices would love
> > to have HZ=1000, while other devices aren’t willing to pay the
> > power cost of 4x the timer slots, resulting in shorter idle
> > times.
>
> The shorter idle times are because timer wheel timers wake up more
> accurately with HZ=1000 and not because the scheduler is more agressive?
Saravana and Samuel gave a talk on this at plumbers that has more details
https://lpc.events/event/18/contributions/1842/
Specifically, starting on slide 9 here:
https://lpc.events/event/18/contributions/1842/attachments/1476/3126/LPC%202024%20-%20Wattson.pdf
Where the load is a fairly light one. I'm regenerating some data on
that now to look a little more closely at to make sure it wasn't
faster up-migration to costlier cpus (but I'd not expect that given
how light the load is).
But even if it is the faster up-migration, the example showed HZ=1000
to have added power costs for workloads that were performing
acceptably at HZ=250. Which makes it a hard sell for folks who are
focused on avoiding power regressions. Potentially folks could tune
their way out of that problem, but it's a larger ask. Thus the desire
to thread the needle somehow to allow HZ=1000 as an option without
disrupting folks happy with current HZ=250 behavior.
> > Also, I've not yet gotten this to work for the fixed
> > periodic-tick paths (you need a oneshot capable clockevent).
>
> Which is not a given on the museum pieces we insist to support just
> because we can. But with periodic timers it should be easy enough to
> make clockevents::set_state_periodic() take a tick frequency argument
> and convert the ~70 callbacks to handle it.
I'll take a look at this. Mostly I was just highlighting a deficiency
in the patch where DynHZ wouldn't impact things, as
periodic-clockevents didn't seem the most critical for the benefit of
the feature.
> > Mostly because in that case we always just increment by a single
> > tick. While for dyn_hz=250 or dyn_hz=1000 calculating the
> > periodic tick count is pretty simple (4 ticks, 10 ticks). But
> > for dyn_hz=300, or other possible values, it doesn’t evenly
> > divide, so we would have to do a 3,3,4,3,3,4 style interval to
> > stay on time and I’ve not yet thought through how to do
> > remainder handling efficiently yet.
>
> I doubt you need that. Programming it to the next closest value is good
> enough and there is no reason to overengineer it for a marginal benefit
> of "accuracy". But that's obviously not really working with your chosen
> minimalistic approach.
>
> Aside of that, using random HZ values is a pretty academic exercise and
> HZ=300 had been introduced for multimedia to cater for 30FPS. But that
> was long ago when high resolution timers, NOHZ and modern graphic
> devices did not exist.
>
> I seriously doubt that HZ=300 has any actual advantage on modern
> systems. Sure, I know that SteamOS uses HZ=300, but AFAICT from public
> discussions this just caters to the HZ=300 myth and is not backed by any
> factual evidence that HZ=300 is so superior. Quite the contrary there
> are enough people who actually want HZ=1000 for better responsiveness.
Yeah. Initially I was shooting for dyn_hz to be possibly even more
flexible then HZ (500! 10! why not?) since it initially seemed it
would be simple, but I obviously underestimated the accounting
complexity, so I'm fine if it's relegated to 250 or 100. :)
> But let me come back to your proposed hack, which is admittedly cute.
> Though I'm not really convinced that it is more than a bandaid, which
> papers over the most obvious places to make it "work".
>
> Let's take a step back and look at the usage of 'HZ':
>
> 1) Jiffies and related timer wheel interfaces
>
> jiffies should just go away completely and be replaced by a simple
> millisecond counter, which is accessible in the same way as
> jiffies today.
>
> That removes the bulk of HZ usage all over the place and makes the
> usage sites simpler as the interfaces just use SI units and the
> gazillions (~4500 to jiffies and ~1000 from jiffies) back and
> forth conversions just go away.
Yeah, this was basically where I was hoping this would allow us to go.
I was imagining once dyn_hz was possible, we could basically fix HZ to
1000 internally, leaving jiffies as that 1ms counter, and let the
actual interrupt rate be set via the dynhz default config value. Then
iterating through all the code switching HZ usage to MSEC_PER_SEC, etc
would be possible.
> We obviously need to keep the time_before/after/*() interfaces for
> 32bit, unless we decide to limit the uptime for 32-bit machines to
> ~8 years and force reboot them before the counter can overflow :)
>
> On the timer wheel side that means that the base granularity is
> always 1ms, which only affects the maximum timeout. The timer
> expiry is just batched on the actual tick frequency and should not
> have any other side effects except for slightly moving the
> granularity boundaries depending on the tick frequency. But that's
> not any different from the hard coded HZ values.
>
> The other minor change is to make the next timer interrupt
> retrieval for NOHZ round up the next event to the tick boundary,
> but that's trivial enough.
>
> 2) Clock events
>
> Periodic mode is trivial to fix with a tick frequency argument to
> the set_state_periodic() callback.
>
> Oneshot mode just works as it programs the hardware to the next
> closest event. Not much different from the current situation with a
> hard-coded HZ value.
Ack.
> 3) Accounting
>
> The accounting has to be seperated from the jiffies advancement and
> it has to feed the delta to the last tick in nanoseconds into the
> accounting path, which internally operates in nanoseconds already
> today.
I'll take a look at this next, as the accounting has clearly been my
biggest fumbling point. Keeping the per-cpu last tick times is
probably the key change (my main concern is the extra cost of reading
the ns time, but deriving it from 1ms jiffies counter could be an
option), but yeah, I think this sounds like a nice direction. Thank
you for the suggestion!
> 4) Scheduler
>
> I leave that part to Peter as he definitely has a better overview
> of what needs to be done than me.
Yeah, I'll look more into his concerns around the sched/events code.
Thanks again for all the feedback! Much appreciated!
-john
Powered by blists - more mailing lists