linux-kernel - Re: Stopping the tick on a fully loaded system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8857d035-1c1a-27dd-35cf-7ff68bbf3119@linutronix.de>
Date:   Tue, 25 Jul 2023 15:07:05 +0200 (CEST)
From:   Anna-Maria Behnsen <anna-maria@...utronix.de>
To:     "Rafael J. Wysocki" <rafael@...nel.org>
cc:     Frederic Weisbecker <frederic@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        "Gautham R. Shenoy" <gautham.shenoy@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>
Subject: Re: Stopping the tick on a fully loaded system

Hi,

On Mon, 24 Jul 2023, Rafael J. Wysocki wrote:

> On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker
> <frederic@...nel.org> wrote:
> >
> From the governor's perspective, tick_nohz_get_sleep_length() is
> supposed to return a deterministic upper bound on the upcoming idle
> duration (or at least it is used in the governors this way).  IOW, it
> is expected that the CPU will not be idle longer that the
> tick_nohz_get_sleep_length() return value.
> 
> There are other factors that may cause the governor to predict a
> shorter idle duration and the information coming from the scheduler
> may be regarded as one of them, but they are not deterministic.

Ok. Thanks for this explanation why two separate checks are required.

> > > Sure, teo takes scheduler utilization into account directly in the
> > > governor. But for me it is not comprehensible, why the CPU utilization
> > > check is done after asking for the possible sleep length where timers are
> > > taken into account. If the CPU is busy anyway, the information generated by
> > > tick_nohz_next_event() is irrelevant.
> 
> Why isn't it?
> 
> The CPU is idle at that point and it has gone idle for a reason.
> Surely, there was nothing to run on it at.
> 
> The scheduler only knows that the CPU has been busy recently, which
> may not imply anything on whether or not and when it is going to be
> busy again.
>

I went back one step and simply had a look at current upstream to
understand the behavior when going idle under load more detailed. I wanted
to see the distribution of idle time duration when the tick is stopped. I
used dbench tests with a script provided by Gautham to generate three
different loads: 100% load, 50% load and 25% load. The kernel is configured
with HZ=250. The system has 2 sockets, 64 cores per socket, 2 threads each
-> 256 CPUs. Three minutes trace data - idle periods larger than three
minutes are not tracked. The governor is teo.

I added tracepoints to the point where the tick is stopped and where the
tick is started again. I calculated the delta between the timestamps of
those two trace points and had a look at the distribution:


			100% load		50% load		25% load
			(top: ~2% idle)		(top: ~49% idle)	(top: ~74% idle;
									33 CPUs are completely idle)
			---------------		----------------	----------------------------
Idle Total		1658703	100%		3150522	100%		2377035 100%
x >= 4ms		2504	0.15%		2	0.00%		53	0.00%
4ms> x >= 2ms		390	0.02%		0	0.00%		4563	0.19%
2ms > x >= 1ms		62	0.00%		1	0.00%		54	0.00%
1ms > x >= 500us	67	0.00%		6	0.00%		2	0.00%
500us > x >= 250us	93	0.01%		39	0.00%		11	0.00%
250us > x >=100us	280	0.02%		1145	0.04%		633	0.03%
100us > x >= 50us	942	0.06%		30722	0.98%		13347	0.56%
50us > x >= 25us	26728	1.61%		310932	9.87%		106083	4.46%
25us > x >= 10us	825920	49.79%		2320683	73.66%		1722505	72.46%
10us > x > 5us		795197	47.94%		442991	14.06%		506008	21.29%
5us > x			6520	0.39%		43994	1.40%		23645	0.99%


99% of the tick stops only have an idle period shorter than 50us (50us is
1,25% of a tick length).

This is also the reason for my opinion, that the return of
tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
The return is in the range of ticks, and a tick is ~100 times longer than
the the duration of the majority of idle periods.


> > > And when the CPU is not busy, then it
> > > makes sense to ask for the sleep length also from a timer perspective.
> > >
> > > When this CPU utilization check is implemented directly inside the
> > > governor, every governor has to implement it on it's own. So wouldn't it
> > > make sense to implement a "how utilized is the CPU out of a scheduler
> 
> Yes, it would make sense to do that, but I thought that PELT was that
> thing.  Wasn't it?
> 
> > > perspective" in one place and use this as the first check in
> > > tick_nohz_get_sleep_length()/tick_nohz_next_event()?

[...]

> > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
> > tick _stop_ time is always going to be problematic.
> >
> > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
> > instead? This will change a bit the locking scenario because
> > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
> > In the worst case a remote tick happens and handles the earliest timer
> > for the current CPU while it's between tick_nohz_next_event() and
> > tick_nohz_stop_tick(), but then the current CPU would just propagate
> > an earlier deadline than needed. No big deal.
> 
> FWIW, this sounds reasonable to me.
> 

The worst case scenario will not happen, because remote timer expiry only
happens when CPU is not active in the hierarchy. And with your proposal
this is valid after tick_nohz_stop_tick().

Nevertheless, I see some problems with this. But this also depends if there
is the need to change current idle behavior or not. Right now, this are my
concerns:

- The determinism of tick_nohz_next_event() will break: The return of
  tick_nohz_next_event() will not take into account, if it is the last CPU
  going idle and then has to take care of remote timers. So the first timer
  of the CPU (regardless of global or local) has to be handed back even if
  it could be handled by the hierarchy.

- When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the
  return value of tmigr_cpu_deactivate() is before the ts->next_tick, the
  expiry has to be modified in tick_nohz_stop_tick().

- The load is simply moved to a later place - tick_nohz_stop_tick() is
  never called without a preceding tick_nohz_next_event() call. Yes,
  tick_nohz_next_event() is called under load ~8% more than
  tick_nohz_stop_tick(), but the 'quality' of the return value of
  tick_nohz_next_event() is getting worse.

- timer migration hierarchy is not a standalone timer infrastructure. It
  only makes sense to handle it in combination with the existing timer
  wheel. When the timer base is idle, the timer migration hierarchy with
  the migrators will do the job for global timers. So, I'm not sure about
  the impact of the changed locking - but I'm pretty sure changing that
  increases the probability for ugly races hidden somewhere between the
  lines.

Thanks,

	Anna-Maria