linux-kernel - Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Y2Hc47MqcGiT1lUE@feng-clx>
Date:   Wed, 2 Nov 2022 10:58:43 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>
CC:     <linux-kernel@...r.kernel.org>, <clm@...a.com>,
        <jstultz@...gle.com>, <tglx@...utronix.de>, <sboyd@...nel.org>,
        <longman@...hat.com>
Subject: Re: [PATCH clocksource] Reject bogus watchdog clocksource
 measurements

On Tue, Nov 01, 2022 at 12:06:27PM -0700, Paul E. McKenney wrote:
> On Tue, Nov 01, 2022 at 01:43:32PM +0800, Feng Tang wrote:
> > On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:
> > 
> > [...]
> > > > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >  			continue;
> > > > >  		}
> > > > >  		if (wd_nsec > (wdi << 2)) {
> > > > 
> > > > Just recalled one thing, that it may be better to check 'cs_nsec' 
> > > > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > > > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > > > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > > > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > > > the watchdog's value could 'overflow' many times.
> > > > 
> > > > And usually the 'current' closcksource has longer wrap time than
> > > > the watchdog.
> > > 
> > > Why not both?
> > 
> > You mean checking both clocksource and the watchdog? It's fine for
> > me, though I still trust clocksource more.
> 
> OK, good, I will check both.  You never know what future hardware
> might bring.

Makes sense to me.

> I also reversed the order of the checks, so that it now checks for heavy
> load before too-short interval.  The purpose is to automatically avoid
> being fooled by clock wrap.
> 
> > I checked some old emails and found some long stall logs for reference.
> > 
> > * one stall of 471 seconds
> > 
> >  [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
> >  [ 2410.706920] clocksource:                       'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
> >  [ 2410.718583] clocksource:                       'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
> >  [ 2410.732568] clocksource:                       'tsc' is current clocksource.
> >  [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
> >  [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> >  [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
> >  [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
> >  [ 2410.784045] clocksource: Switched to clocksource hpet
> > 
> > 
> > * another one of 5 seconds
> > 
> >  [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
> >  [ 3302.211710] clocksource:                       'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
> >  [ 3302.211712] clocksource:                       'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
> >  [ 3302.211714] clocksource:                       'tsc' is current clocksource.
> >  [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog
> 
> Very good, thank you!  I believe that both of these would be handled
> by the updated commit (see below for the update).

Yes, I think so too.

> 
> > >  		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> > > 
> > > > > -			/* This can happen on busy systems, which can delay the watchdog. */
> > > > > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > > +			bool needwarn = false;
> > > > > +			u64 wd_lb;
> > > > > +
> > > > > +			cs->wd_bogus_count++;
> > > > > +			if (!cs->wd_bogus_shift) {
> > > > > +				needwarn = true;
> > > > > +			} else {
> > > > > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > > > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > > > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > > > +					needwarn = true;
> > > > 
> > > > I'm not sure if we need to check the last_bogus counter, or just
> > > > the current interval 'cs_nsec' is what we care, and some code
> > > > like this ?
> > > 
> > > I thought we wanted exponential backoff?  Do you really get that from
> > > the changes below?
> > 
> > Aha, I misunderstood your words. I thought to only report one time for
> > each 2, 4, 8, ... 256 seconds stall, and after that only report stall
> > of 512+ seconds. So your approach looks good to me, as our intention is
> > to avoid the flood of warning message.
> 
> Sounds good, thank you!
> 
> Please see below for a patch to be squashed into the original.
> 
> Thoughts?

It looks good to me, thanks!

- Feng

>
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit eaee921daa7091f0eb731c9217ccc638ed5f8baf
> Author: Paul E. McKenney <paulmck@...nel.org>
> Date:   Tue Nov 1 12:02:18 2022 -0700
> 
>     squash! clocksource: Exponential backoff for load-induced bogus watchdog reads
>     
>     [ paulmck: Apply Feng Tang feedback. ]
>     
>     Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 6537ffa02e445..de8047b6720f5 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -442,12 +442,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>  
>  		/* Check for bogus measurements. */
>  		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> -		if (wd_nsec < (wdi >> 2)) {
> -			/* This usually indicates broken timer code or hardware. */
> -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> -			continue;
> -		}
> -		if (wd_nsec > (wdi << 2)) {
> +		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
>  			bool needwarn = false;
>  			u64 wd_lb;
>  
> @@ -470,6 +465,12 @@ static void clocksource_watchdog(struct timer_list *unused)
>  			}
>  			continue;
>  		}
> +		/* Check too-short measurements second to handle wrap. */
> +		if (wd_nsec < (wdi >> 2) || cs_nsec < (wdi >> 2)) {
> +			/* This usually indicates broken timer code or hardware. */
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}
>  
>  		/* Check the deviation from the watchdog clocksource. */
>  		md = cs->uncertainty_margin + watchdog->uncertainty_margin;