linux-kernel - Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210422142454.GD975577@paulmck-ThinkPad-P17-Gen-1>
Date:   Thu, 22 Apr 2021 07:24:54 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Feng Tang <feng.tang@...el.com>
Cc:     Xing Zhengjun <zhengjun.xing@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        John Stultz <john.stultz@...aro.org>,
        Stephen Boyd <sboyd@...nel.org>,
        Jonathan Corbet <corbet@....net>,
        Mark Rutland <Mark.Rutland@....com>,
        Marc Zyngier <maz@...nel.org>, Andi Kleen <ak@...ux.intel.com>,
        Chris Mason <clm@...com>, LKML <linux-kernel@...r.kernel.org>,
        lkp@...ts.01.org, lkp@...el.com
Subject: Re: [LKP] Re: [clocksource] 6c52b5f3cf: stress-ng.opcode.ops_per_sec
 -14.4% regression

On Thu, Apr 22, 2021 at 03:41:26PM +0800, Feng Tang wrote:
> Hi Paul,
> 
> On Thu, Apr 22, 2021 at 02:58:27PM +0800, Xing Zhengjun wrote:
> > 
> > 
> > On 4/21/2021 9:42 PM, Paul E. McKenney wrote:
> > >On Wed, Apr 21, 2021 at 02:07:19PM +0800, Xing, Zhengjun wrote:
> > >>
> > >>On 4/20/2021 10:05 PM, Paul E. McKenney wrote:
> > >>>On Tue, Apr 20, 2021 at 06:43:31AM -0700, Paul E. McKenney wrote:
> > >>>>On Tue, Apr 20, 2021 at 02:49:34PM +0800, kernel test robot wrote:
> > >>>>>Greeting,
> > >>>>>
> > >>>>>FYI, we noticed a -14.4% regression of stress-ng.opcode.ops_per_sec due to commit:
> > >>>>>
> > >>>>>
> > >>>>>commit: 6c52b5f3cfefd6e429efc4413fd25e3c394e959f ("clocksource: Reduce WATCHDOG_THRESHOLD")
> > >>>>>https://git.kernel.org/cgit/linux/kernel/git/paulmck/linux-rcu.git dev.2021.04.13a
> > >>>>>
> > >>>>>
> > >>>>>in testcase: stress-ng
> > >>>>>on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
> > >>>>>with following parameters:
> > >>>>>
> > >>>>>	nr_threads: 10%
> > >>>>>	disk: 1HDD
> > >>>>>	testtime: 60s
> > >>>>>	fs: ext4
> > >>>>>	class: os
> > >>>>>	test: opcode
> > >>>>>	cpufreq_governor: performance
> > >>>>>	ucode: 0x5003006
> > >>>>Hmmm...  I will try a less-aggressive reduction.  Thank you for testing!
> > >>>But wait...  This code is only running twice per second.  It is very
> > >>>hard to believe that a clock-read retry twice per second is worth 2% of
> > >>>performance, let alone 14.4%.
> > >>>
> > >>>Is something else perhaps going on here?
> > >>>
> > >>>For example, did this run enable any of the new diagnositic clocksource.*
> > >>>kernel parameters?
> > >>>
> > >>>								Thanx, Paul
> > >>I attached the kernel log, the following logs are related with the
> > >>clocksource.
> > >>[    3.453206] clocksource: timekeeping watchdog on CPU1: Marking
> > >>clocksource 'tsc-early' as unstable because the skew is too large:
> > >>[    3.455197] clocksource:                       'hpet' wd_now: 288fcc0
> > >>wd_last: 1a8b333 mask: ffffffff
> > >>[    3.455199] clocksource:                       'tsc-early' cs_now:
> > >>1def309ebfdee cs_last: 1def2bd70d92c mask: ffffffffffffffff
> > >>[    3.455201] clocksource:                       No current clocksource.
> > >>[    3.457197] tsc: Marking TSC unstable due to clocksource watchdog
> > >>
> > >>6c52b5f3cf reduced WATCHDOG_THRESHOLD, then in clocksource_watchdog, the
> > >>warning logs are print, the TSC is marked as unstable.
> > >>/* Check the deviation from the watchdog clocksource. */
> > >Aha, so this system really does have an unstable TSC!  Which means that
> > >the patch is operating as designed.
> > >
> > >Or are you saying that this is a false positive?
> > >
> > >							Thanx, Paul
> > 
> > It happened during boot and before TSC calibration
> > (tsc_refine_calibration_work()), so on some machines "abs(cs_nsec - wd_nsec)
> > > WATCHDOG_THRESHOLD", WATCHDOG_THRESHOLD is set too small at that time.
> > After TSC calibrated, abs(cs_nsec - wd_nsec) should be very small,
> > WATCHDOG_THRESHOLD for here is ok. So I suggest increasing the
> > WATCHDOG_THRESHOLD before TSC calibration, for example, the clocks be skewed
> > by more than 1% to be marked unstable.

This is common code, so we do need an architecture-independent way to
handle this.

> As Zhengjun measuered, this is a Cascade Lake platform, and it has 2
> times calibration of tsc, the first one of early quick calibration gives
> 2100 MHz, while the later accurate calibration gives 2095 MHz, so there
> is about 2.5/1000 deviation for the first number, which just exceeds the
> 1/1000 threshold you set :)

Even my 2/1000 initial try would have caused this, then.  ;-)

But even 1/1000 deviation would cause any number of applications some
severe heartburn, so I am not at all happy with the thought of globally
increasing to (say) 3/1000.

> Following is the tsc freq info from kernel log
> 
> [    0.000000] DMI: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
> [    0.000000] tsc: Detected 2100.000 MHz processor
> ...
> [   13.859982] tsc: Refined TSC clocksource calibration: 2095.077 MHz

So what are our options?

1.	Clear CLOCK_SOURCE_MUST_VERIFY from tsc-early.

2.	#1, but add tsc-early into the watchdog list and set
	CLOCK_SOURCE_MUST_VERIFY once it is better calibrated.

3.	Add a field to struct clocksource that, if non-zero, gives
	the maximum drift in nanoseconds per half second (AKA
	WATCHDOG_INTERVAL).  If zero, the WATCHDOG_MAX_SKEW value
	is used.  Set this to (say) 150,000ns for tsc-early.

4.	As noted earlier, increase WATCHDOG_MAX_SKEW to 150 microseconds,
	which again is not a good approach given the real-world needs
	of real-world applications.

5.	Your ideas here.

All in all, I am glad that I made the patch that decreases
WATCHDOG_MAX_SKEW be separate and at the end of the series.  ;-)

							Thanx, Paul