linux-kernel - Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211112054417.GA29845@shbuild999.sh.intel.com>
Date:   Fri, 12 Nov 2021 13:44:17 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Waiman Long <longman@...hat.com>,
        John Stultz <john.stultz@...aro.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Cassio Neri <cassio.neri@...il.com>,
        Linus Walleij <linus.walleij@...aro.org>,
        Colin Ian King <colin.king@...onical.com>,
        Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of
 clocksources

On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > sometimes happen on both Intel and AMD systems especially when they are
> > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > the test run.
> > > 
> > > The result of falling back to hpet is a drastic reduction of performance
> > > when running benchmarks. For example, the fio performance tests can
> > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > 
> > > 4 hpet fallbacks happened during bootup. They were:
> > > 
> > >   [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > >   [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > >   [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > >   [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > 
> > > Other fallbacks happen when the systems were running stressful
> > > benchmarks. For example:
> > > 
> > >   [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > >   [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > 
> > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > changed the skew margin from 100us to 50us. I think this is too small
> > > and can easily be exceeded when running some stressful workloads on
> > > a thermally stressed system.  So it is switched back to 100us. On
> > > the other hand, it doesn't look like we need to increase the minimum
> > > uncertainty margin. So it is kept the same at 100us too.
> > > 
> > > Even a maximum skew margin of 100us may be too small in for some systems
> > > when booting up especially if those systems are under thermal stress. To
> > > eliminate the case that the large skew is due to the system being too
> > > busy slowing down the reading of both the watchdog and the clocksource,
> > > a final check is done by reading watchdog time again and comparing the
> > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > that delay exceeds the limit, we assume that the system is just too
> > > busy. A warning will be printed to the console and the watchdog check
> > > is then skipped for this round. For example:
> > > 
> > >   [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> >  
> > 
> > I think it may be better to add more details about the root cause, other
> > than that it looks good to me, as we tested similar patch on our test
> > platforms.
> > 
> > Reviewed-by: Feng Tang <feng.tang@...el.com>
> 
> Thank you both!
> 
> I agree on the bit about root cause.  Would it make sense to compare the
> difference between HPET reads 1 and 2 (containing the read of the TSC)
> and the difference between HPET reads 2 and 3?  If the 2-1 difference was
> no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> was no more than (say) 20 microseconds more than the 3-2 difference,
> this could be considered a good-as-it-gets read, ending the retry loop.
> Then if the 3-1 difference was greater than the default (100 microseconds
> in current -rcu), that difference could be substituted for that particular
> clocksource watchdog check.  With a console message noting the unusually
> high overhead (but not a splat).
> 
> So if it took 75 microseconds for each HPET read and 1 microsecond for
> the TSC read, then 226 microseconds would be substituted for the default
> of 100 microseconds for that cycle's skew cutoff.  Unless the previous
> skew cutoff was larger, in which case the previous cutoff should be
> used instead.  Either way, the current cutoff is recorded for comparison
> for the next clocksource watchdog check.
> 
> If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> probably be emitted anyway.
 
I can test the patch with our cases that could reproduce the problem.

> Or did you have something else in mind?

I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
the delay between watchdog's (HPET here) read were not linear, that
from debug data, sometimes the 3-2 difference could be bigger or much
bigger than the 2-1 difference.

The reason could be the gap between 2 reads depends hugely on the system
pressure at that time that 3 HPET read happens. On our test box (a
2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
about 2.5 us,  while under the stress it could be bumped to from 6 us
to 2800 us.

So I think checking the 3-2 difference plus increasing the max retries
to 10 may be a simple way, if the watchdog read is found to be
abnormally long, we skip this round of check.

Thanks,
Feng


> 
> 							Thanx, Paul