linux-kernel - Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y6Ja+kYQAi4pppV6@feng-clx>
Date:   Wed, 21 Dec 2022 09:01:46 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>
CC:     Waiman Long <longman@...hat.com>, John Stultz <jstultz@...gle.com>,
        "Thomas Gleixner" <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>, <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        <linux-kernel@...r.kernel.org>, Tim Chen <tim.c.chen@...el.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when
 high read lantency detected

Using correct email address of John Stultz.

On Tue, Dec 20, 2022 at 10:34:00AM -0800, Paul E. McKenney wrote:
> On Tue, Dec 20, 2022 at 11:11:08AM -0500, Waiman Long wrote:
> > On 12/20/22 03:25, Feng Tang wrote:
> > > There were bug reported on 8 sockets x86 machines that TSC was wrongly
> > > disabled when system is under heavy workload.
> > > 
> > >   [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
> > >   [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
> > >   [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
> > >   [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
> > >   [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
> > >   [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
> > >   [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > >   [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
> > >   [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
> > >   [ 821.067990] clocksource: Switched to clocksource hpet
> > > 
> > > This can be reproduced when system is running memory intensive 'stream'
> > > test, or some stress-ng subcases like 'ioport'.
> > > 
> > > The reason is when system is under heavy load, the read latency of
> > > clocksource can be very high, it can be seen even with lightweight
> > > TSC read, and is much worse on MMIO or IO port read based external
> > > clocksource. Causing the watchdog check to be inaccurate.
> > > 
> > > As the clocksource watchdog is a lifetime check with frequency of
> > > twice a second, there is no need to rush doing it when the system
> > > is under heavy load and the clocksource read latency is very high,
> > > suspend the watchdog timer for 5 minutes.
> > > 
> > > Signed-off-by: Feng Tang <feng.tang@...el.com>
> > > ---
> > >   kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
> > >   1 file changed, 32 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > index 9cf32ccda715..8cd74b89d577 100644
> > > --- a/kernel/time/clocksource.c
> > > +++ b/kernel/time/clocksource.c
> > > @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
> > >   }
> > >   EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
> > > +static inline void clocksource_reset_watchdog(void)
> > > +{
> > > +	struct clocksource *cs;
> > > +
> > > +	list_for_each_entry(cs, &watchdog_list, wd_list)
> > > +		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > +}
> > > +
> > > +
> > >   static void clocksource_watchdog(struct timer_list *unused)
> > >   {
> > >   	u64 csnow, wdnow, cslast, wdlast, delta;
> > > @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >   	int64_t wd_nsec, cs_nsec;
> > >   	struct clocksource *cs;
> > >   	enum wd_read_status read_ret;
> > > +	unsigned long extra_wait = 0;
> > >   	u32 md;
> > >   	spin_lock(&watchdog_lock);
> > > @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >   		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > > -		if (read_ret != WD_READ_SUCCESS) {
> > > -			if (read_ret == WD_READ_UNSTABLE)
> > > -				/* Clock readout unreliable, so give it up. */
> > > -				__clocksource_unstable(cs);
> > > +		if (read_ret == WD_READ_UNSTABLE) {
> > > +			/* Clock readout unreliable, so give it up. */
> > > +			__clocksource_unstable(cs);
> > >   			continue;
> > >   		}
> > > +		/*
> > > +		 * When WD_READ_SKIP is returned, it means the system is likely
> > > +		 * under very heavy load, where the latency of reading
> > > +		 * watchdog/clocksource is very big, and affect the accuracy of
> > > +		 * watchdog check. So give system some space and suspend the
> > > +		 * watchdog check for 5 minutes.
> > > +		 */
> > > +		if (read_ret == WD_READ_SKIP) {
> > > +			/*
> > > +			 * As the watchdog timer will be suspended, and
> > > +			 * cs->last could keep unchanged for 5 minutes, reset
> > > +			 * the counters.
> > > +			 */
> > > +			clocksource_reset_watchdog();
> > > +			extra_wait = HZ * 300;
> > > +			break;
> > > +		}
> > > +
> > >   		/* Clocksource initialized ? */
> > >   		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > >   		    atomic_read(&watchdog_reset_pending)) {
> > > @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >   	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> > >   	 */
> > >   	if (!timer_pending(&watchdog_timer)) {
> > > -		watchdog_timer.expires += WATCHDOG_INTERVAL;
> > > +		watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> > >   		add_timer_on(&watchdog_timer, next_cpu);
> > >   	}
> > >   out:
> > > @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
> > >   	watchdog_running = 0;
> > >   }
> > > -static inline void clocksource_reset_watchdog(void)
> > > -{
> > > -	struct clocksource *cs;
> > > -
> > > -	list_for_each_entry(cs, &watchdog_list, wd_list)
> > > -		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > -}
> > > -
> > >   static void clocksource_resume_watchdog(void)
> > >   {
> > >   	atomic_inc(&watchdog_reset_pending);
> > 
> > It looks reasonable to me. Thanks for the patch.
> > 
> > Acked-by: Waiman Long <longman@...hat.com>
> 
> Queued, thank you both!

Thanks for reviewing and queueing!

> If you would like this to go in some other way:
> 
> Acked-by: Paul E. McKenney <paulmck@...nel.org>
> 
> And while I am remembering it...  Any objections to reversing the role of
> TSC and the other timers on systems where TSC is believed to be accurate?
> So that if there is clocksource skew, HPET is marked unstable rather than
> TSC?

For the bug in commit log, I think it's the 8 sockets system with
hundreds of CPUs causing the big latency, while the HPET itself may
not be broken, and if we switched to ACPI PM_TIMER as watchdog, we
could see similar big latency. 

I used to only see this issue with stress tool like stress-ng, but
seems with larger and larger system, even the momory intensive load
can easily trigger this.

> This would preserve the diagnostics without hammering performance
> when skew is detected.  (Switching from TSC to HPET hammers performance
> enough that our automation usually notices and reboots the system.)

Yes, switching to HPET is a disaster for performance, we've seen
from 30% to 90% drop in different benchmarks.

Thanks,
Feng

> 							Thanx, Paul