linux-kernel - Re: Stopping the tick on a fully loaded system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hLprrBhfqMRUhStvmm3D_xaSLxmNOYB4sfhLSzLYeR-w@mail.gmail.com>
Date:   Wed, 26 Jul 2023 20:30:01 +0200
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        "Gautham R. Shenoy" <gautham.shenoy@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>
Subject: Re: Stopping the tick on a fully loaded system

On Wed, Jul 26, 2023 at 6:40 PM Anna-Maria Behnsen
<anna-maria@...utronix.de> wrote:
>
> Hi,
>
> On Wed, 26 Jul 2023, Peter Zijlstra wrote:
>
> > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> >
> > > >                         100% load               50% load                25% load
> > > >                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
> > > >                                                                         33 CPUs are completely idle)
> > > >                         ---------------         ----------------        ----------------------------
> > > > Idle Total              1658703 100%            3150522 100%            2377035 100%
> > > > x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> > > > 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> > > > 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> > > > 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> > > > 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> > > > 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> > > > 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> > > > 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> > > > 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> > > > 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> > > > 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
> > > >
> > > >
> > > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > > 1,25% of a tick length).
> > >
> > > Well, this just means that the governor predicts overly long idle
> > > durations quite often under this workload.
> > >
> > > The governor's decision on whether or not to stop the tick is based on
> > > its idle duration prediction.  If it overshoots, that's how it goes.
> >
> > This is abysmal; IIRC TEO tracks a density function in C state buckets
> > and if it finds it's more likely to be shorter than 'predicted' by the
> > timer it should pick something shallower.
> >
> > Given we have this density function, picking something that's <1% likely
> > is insane. In fact, it seems to suggest the whole pick-alternative thing
> > is utterly broken.
> >
>
> When I tried to understand the cstates, I noticed cstates have been
> disabled on the zen3 machine I used for testing - I'm sorry, pilot error.
>
> So the numbers above are caused by calling tick_nohz_idle_stop_tick()
> unconditionally in cpuidle_idle_call() when cpuidle_not_available() is
> true.
>
> The regression Gautham observed was then caused by tons of
> tick_nohz_next_event() calls, which are more expensive with the current
> implementation of the timer migration hierarchy (if he tested with cstates
> enabled...).
>
> Nevertheless, I rerun the tests on current upstream with cstates enabled on
> zen3 machine and on a SKL-X with governor teo, menu and ladder and
> generated the following numbers (100% load):
>
> Zen3:
>                         teo                     menu                    ladder
>                         ------------------      ------------------      ------------------
> Idle Total              2533    100.00%         5123    100.00%         1333746 100.00%
> x >= 4ms                1458    57.56%          2764    53.95%          2304    0.17%
> 4ms> x >= 2ms           91      3.59%           95      1.85%           98      0.01%
> 2ms > x >= 1ms          56      2.21%           66      1.29%           57      0.00%
> 1ms > x >= 500us        64      2.53%           74      1.44%           61      0.00%
> 500us > x >= 250us      73      2.88%           39      0.76%           69      0.01%
> 250us > x >=100us       76      3.00%           88      1.72%           502     0.04%
> 100us > x >= 50us       33      1.30%           104     2.03%           3976    0.30%
> 50us > x >= 25us        39      1.54%           289     5.64%           64463   4.83%
> 25us > x >= 10us        199     7.86%           830     16.20%          1245946 93.42%
> 10us > x > 5us          156     6.16%           231     4.51%           9452    0.71%
> 5us > x                 288     11.37%          543     10.60%          6818    0.51%
>
> tick_nohz_next_event()
> total count             8839790                 2113357                 1363896
>
>
>
> SKL-X:
>                         teo                     menu                    ladder
>                         ------------------      ------------------      ------------------
> Idle Total              2388    100.00%         2047    100.00%         693514  100.00%
> x >= 4ms                2047    85.72%          1347    65.80%          1141    0.16%
> 4ms> x >= 2ms           29      1.21%           47      2.30%           18      0.00%
> 2ms > x >= 1ms          20      0.84%           9       0.44%           10      0.00%
> 1ms > x >= 500us        21      0.88%           17      0.83%           10      0.00%
> 500us > x >= 250us      15      0.63%           26      1.27%           9       0.00%
> 250us > x >=100us       67      2.81%           39      1.91%           24      0.00%
> 100us > x >= 50us       18      0.75%           26      1.27%           17      0.00%
> 50us > x >= 25us        15      0.63%           28      1.37%           2141    0.31%
> 25us > x >= 10us        31      1.30%           61      2.98%           108208  15.60%
> 10us > x > 5us          37      1.55%           195     9.53%           242809  35.01%
> 5us > x                 88      3.69%           252     12.31%          339127  48.90%
>
> tick_nohz_next_event()
> total count             2317973                 2481724                 701069

This looks better indeed.  What's the HZ value?

> With this (and hopefully without another pilot error), I see the following
> 'open points' where improvement or more thoughts might be good:
>
> - Without cstates enabled, it is possible to change the cpuidle governors
>   even if they do not have an impact on idle behavior but at the first
>   glance it looks like cpuidle governors are used. Is this behavior
>   intended?

Not really intended and it can be improved.  I guess it's this way,
because there were no complaints. :-)

> - When there is no cpuidle driver, tick_nohz_idle_stop_tick() is called
>   unconditionally - is there the possibility to make an easy check whether
>   the CPU is loaded?

PELT could be asked like in the teo governor.  Or it may be a better
idea to simply never stop the tick in that case (the default idle
state is going to be shallow anyway, so running the tick shouldn't
matter from the energy POV).

> - The governors teo and menu do the tick_nohz_next_event() check even if
>   the CPU is fully loaded and but the check is not for free.

Let me have a loot at teo in that respect.

The problem is when tick_nohz_get_sleep_length() should not be called.
The easy case is when the governor would select the shallowest idle
state without taking it into account, but what about the deeper ones?
I guess this depends on the exit latency of the current candidate idle
state, but what exit latency would be low enough?  I guess 2 us would
be fine, but what about 10 us, or even 20 us for that matter?

> - timer bases are marked idle in tick_nohz_next_event() when the next
>   expiry is more than a tick away. But when the tick can not be stopped,
>   because CPU is loaded and timer base is alreay marked idle, a remote
>   timer enqueue before clearing timer base idle information will lead to a
>   IPI which is also expensive.
>
>   It might be worth a try to do only a (maybe leaner) check for the next
>   timer in tick_nohz_next_event() and do the actual idle dance in
>   tick_nohz_stop_tick(). When a timer is enqueued remote between
>   tick_nohz_next_event() and tick_nohz_stop_tick() call, there is no need
>   for an IPI - CPU might be prevented from stopping the tick. This is also
>   the case at the moment and only solved by an IPI after tick is already
>   stopped.
>
>   With regard to the timer migration hierarchy, there might be the
>   possibility to do also a quick check in tick_nohz_next_event() and do the
>   final tmigr_cpu_deactivate() thing when stopping the tick and marking the
>   timer bases idle. So no lock ordering change would be required here...
>
> - Side note: When testing 'ladder' on SKL-X machine there was a strange
>   pattern: All CPUs on the second socket, stopped the tick quite often
>   (~12000) and all of the idle durations were below 50us. All CPUs on the
>   first socket stopped the tick max ~100 times and most of the idle
>   durations were longer than 4ms (HZ=250).

I don't really have a quick explanation for that.  It's probably due
to the work distribution pattern.

Not that I think that ladder is interesting any more, TBH.