[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8857d035-1c1a-27dd-35cf-7ff68bbf3119@linutronix.de>
Date: Tue, 25 Jul 2023 15:07:05 +0200 (CEST)
From: Anna-Maria Behnsen <anna-maria@...utronix.de>
To: "Rafael J. Wysocki" <rafael@...nel.org>
cc: Frederic Weisbecker <frederic@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
"Gautham R. Shenoy" <gautham.shenoy@....com>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: Stopping the tick on a fully loaded system
Hi,
On Mon, 24 Jul 2023, Rafael J. Wysocki wrote:
> On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker
> <frederic@...nel.org> wrote:
> >
> From the governor's perspective, tick_nohz_get_sleep_length() is
> supposed to return a deterministic upper bound on the upcoming idle
> duration (or at least it is used in the governors this way). IOW, it
> is expected that the CPU will not be idle longer that the
> tick_nohz_get_sleep_length() return value.
>
> There are other factors that may cause the governor to predict a
> shorter idle duration and the information coming from the scheduler
> may be regarded as one of them, but they are not deterministic.
Ok. Thanks for this explanation why two separate checks are required.
> > > Sure, teo takes scheduler utilization into account directly in the
> > > governor. But for me it is not comprehensible, why the CPU utilization
> > > check is done after asking for the possible sleep length where timers are
> > > taken into account. If the CPU is busy anyway, the information generated by
> > > tick_nohz_next_event() is irrelevant.
>
> Why isn't it?
>
> The CPU is idle at that point and it has gone idle for a reason.
> Surely, there was nothing to run on it at.
>
> The scheduler only knows that the CPU has been busy recently, which
> may not imply anything on whether or not and when it is going to be
> busy again.
>
I went back one step and simply had a look at current upstream to
understand the behavior when going idle under load more detailed. I wanted
to see the distribution of idle time duration when the tick is stopped. I
used dbench tests with a script provided by Gautham to generate three
different loads: 100% load, 50% load and 25% load. The kernel is configured
with HZ=250. The system has 2 sockets, 64 cores per socket, 2 threads each
-> 256 CPUs. Three minutes trace data - idle periods larger than three
minutes are not tracked. The governor is teo.
I added tracepoints to the point where the tick is stopped and where the
tick is started again. I calculated the delta between the timestamps of
those two trace points and had a look at the distribution:
100% load 50% load 25% load
(top: ~2% idle) (top: ~49% idle) (top: ~74% idle;
33 CPUs are completely idle)
--------------- ---------------- ----------------------------
Idle Total 1658703 100% 3150522 100% 2377035 100%
x >= 4ms 2504 0.15% 2 0.00% 53 0.00%
4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19%
2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00%
1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00%
500us > x >= 250us 93 0.01% 39 0.00% 11 0.00%
250us > x >=100us 280 0.02% 1145 0.04% 633 0.03%
100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56%
50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46%
25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46%
10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29%
5us > x 6520 0.39% 43994 1.40% 23645 0.99%
99% of the tick stops only have an idle period shorter than 50us (50us is
1,25% of a tick length).
This is also the reason for my opinion, that the return of
tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
The return is in the range of ticks, and a tick is ~100 times longer than
the the duration of the majority of idle periods.
> > > And when the CPU is not busy, then it
> > > makes sense to ask for the sleep length also from a timer perspective.
> > >
> > > When this CPU utilization check is implemented directly inside the
> > > governor, every governor has to implement it on it's own. So wouldn't it
> > > make sense to implement a "how utilized is the CPU out of a scheduler
>
> Yes, it would make sense to do that, but I thought that PELT was that
> thing. Wasn't it?
>
> > > perspective" in one place and use this as the first check in
> > > tick_nohz_get_sleep_length()/tick_nohz_next_event()?
[...]
> > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
> > tick _stop_ time is always going to be problematic.
> >
> > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
> > instead? This will change a bit the locking scenario because
> > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
> > In the worst case a remote tick happens and handles the earliest timer
> > for the current CPU while it's between tick_nohz_next_event() and
> > tick_nohz_stop_tick(), but then the current CPU would just propagate
> > an earlier deadline than needed. No big deal.
>
> FWIW, this sounds reasonable to me.
>
The worst case scenario will not happen, because remote timer expiry only
happens when CPU is not active in the hierarchy. And with your proposal
this is valid after tick_nohz_stop_tick().
Nevertheless, I see some problems with this. But this also depends if there
is the need to change current idle behavior or not. Right now, this are my
concerns:
- The determinism of tick_nohz_next_event() will break: The return of
tick_nohz_next_event() will not take into account, if it is the last CPU
going idle and then has to take care of remote timers. So the first timer
of the CPU (regardless of global or local) has to be handed back even if
it could be handled by the hierarchy.
- When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the
return value of tmigr_cpu_deactivate() is before the ts->next_tick, the
expiry has to be modified in tick_nohz_stop_tick().
- The load is simply moved to a later place - tick_nohz_stop_tick() is
never called without a preceding tick_nohz_next_event() call. Yes,
tick_nohz_next_event() is called under load ~8% more than
tick_nohz_stop_tick(), but the 'quality' of the return value of
tick_nohz_next_event() is getting worse.
- timer migration hierarchy is not a standalone timer infrastructure. It
only makes sense to handle it in combination with the existing timer
wheel. When the timer base is idle, the timer migration hierarchy with
the migrators will do the job for global timers. So, I'm not sure about
the impact of the changed locking - but I'm pretty sure changing that
increases the probability for ugly races hidden somewhere between the
lines.
Thanks,
Anna-Maria
Powered by blists - more mailing lists