linux-kernel - Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20221223033729.GW4001@paulmck-ThinkPad-P17-Gen-1>
Date:   Thu, 22 Dec 2022 19:37:29 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Feng Tang <feng.tang@...el.com>
Cc:     Waiman Long <longman@...hat.com>, John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>, x86@...nel.org,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...el.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when
 high read lantency detected

On Fri, Dec 23, 2022 at 10:09:04AM +0800, Feng Tang wrote:
> On Thu, Dec 22, 2022 at 01:42:19PM -0800, Paul E. McKenney wrote:
> [...]
> > > > 
> > > > > > Also I've run the patch on a Alderlake system, with a fine acpi pm_timer
> > > > > > and a fake broken pm_timer, and they both works without errors.
> > > > > 
> > > > > Thank you!  Did it correctly identify the fake broken pm_timer as being
> > > > > broken?  If so, may I have your Tested-by?
> > > > 
> > > > On that Alderlake system, HPET will be disabled by kernel, and I
> > > > manually increased the tsc frequency about 1/256 to make pm_timer
> > > > look to have 1/256 deviation. And got dmesg like:
> > > > 
> > > > [    2.738554] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'acpi_pm' as unstable because the skew is too large:
> > > > [    2.738558] clocksource:                       'tsc' wd_nsec: 275956624 wd_now: 13aab38d0d wd_last: 1382c1144d mask: ffffffffffffffff
> > > > [    2.738564] clocksource:                       'acpi_pm' cs_nsec: 277034651 cs_now: 731575 cs_last: 63f3cb mask: ffffff
> > > > [    2.738568] clocksource:                       'tsc' (not 'acpi_pm') is current clocksource.
> > > > 
> > > > The deviation is indeed about 1/256. And pm_timer won't be shown in /sys/:
> > > > 
> > > > /sys/devices/system/clocksource/clocksource0/available_clocksource:tsc 
> > > > /sys/devices/system/clocksource/clocksource0/current_clocksource:tsc
> > > > 
> > > > So feel free to add:
> > > > 
> > > > 	Tested-by: Feng Tang <feng.tang@...el.com>
> > > 
> > > Thank you very much!  I will apply this on my next rebase.
> > 
> > But first, here is a prototype of the limited-time clocksource watchdog.
> > Thoughts?
> > 
> > 						Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit dfbf67806c4c7f2bdd79cdefe86a2bea6e7afcab
> > Author: Paul E. McKenney <paulmck@...nel.org>
> > Date:   Thu Dec 22 13:21:47 2022 -0800
> > 
> >     clocksource: Permit limited-duration clocksource watchdogging
> >     
> >     Some systems want regular clocksource checking, but their workloads
> >     cannot tolerate the overhead of full-time clocksource checking.
> >     Therefore, add a CLOCKSOURCE_WATCHDOG_DURATION Kconfig option and a
> >     clocksource.watchdog_duration kernel-boot parameter that limits the
> >     clocksource watchdog to the specified number of minutes past boot.
> >     Values of zero disable checking, and a value of -1 restores the
> >     traditional behavior of always checking.
> >     
> >     This does change behavior compared to older kernels, but recent kernels
> >     disable the clocksource watchdog completely in the common case where the
> >     TSC is judged to be trustworthy.  This change in behavior is therefore
> >     not a real issue.
>     
> Yes, this changes the general semantics. Last year, I've posted a
> patch to limit the watchdog to run for 10 minutes, and at that time
> Thomas mentioned one of his machine may show tsc issue after running
> for one day depending on work load [1].
> 
> As the intention is to validate HPET/PMTIMER, which are not as
> delicate as TSC, maybe we can add a per-clocksource verify-period
> field, and only set it for HPET/PMTIMER?
> 
> [1]. https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/

Got it.

The workloads I am closest to are OK with the clocksource watchdog
running indefinitely, but thus far the skew is visible very early.
But broken hardware can do whatever it wants whenever it wants.  I could
meet Thomas's request by making the default be indefinite, and allowing
whoever cares to make it finite.  Or maybe the fact that the TSC is not
marked unstable makes a difference.

Thoughts?

							Thanx, Paul

> Thanks,
> Feng
> 
> >     Link: https://lore.kernel.org/all/a82092f5-abc8-584f-b2ba-f06c82ffbe7d@redhat.com/
> >     Reported-by: Waiman Long <longman@...hat.com>
> >     Reported-by: Feng Tang <feng.tang@...el.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
> > 
> > diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> > index bae8f11070bef..2676e011673d5 100644
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -209,5 +209,17 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
> >  	  per million.	If the clocksource is good enough for NTP,
> >  	  it is good enough for the clocksource watchdog!
> >  
> > +config CLOCKSOURCE_WATCHDOG_DURATION
> > +	int "Default time to run clocksource watchdog (in minutes)"
> > +	depends on CLOCKSOURCE_WATCHDOG
> > +	range -1 1440
> > +	default 10
> > +	help
> > +	  Specify the default number of minutes that the clocksource
> > +	  watchdog should run after being started.  Specify -1 to run
> > +	  indefinitely or zero to not run at all.  This value may be
> > +	  overridden using the clocksource.watchdog_duration kernel
> > +	  boot parameter.
> > +
> >  endmenu
> >  endif
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index b153fce8c0e4b..c920c6c22e0fb 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -213,6 +213,9 @@ module_param(max_cswd_read_retries, ulong, 0644);
> >  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> >  static int verify_n_cpus = 8;
> >  module_param(verify_n_cpus, int, 0644);
> > +static int watchdog_duration = CONFIG_CLOCKSOURCE_WATCHDOG_DURATION;
> > +module_param(watchdog_duration, int, 0444);
> > +static unsigned long watchdog_end_jiffies;
> >  
> >  enum wd_read_status {
> >  	WD_READ_SUCCESS,
> > @@ -549,7 +552,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  	 * Arm timer if not already pending: could race with concurrent
> >  	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> >  	 */
> > -	if (!timer_pending(&watchdog_timer)) {
> > +	if (!timer_pending(&watchdog_timer) &&
> > +	    (watchdog_duration < 0 ||
> > +	     (watchdog_duration >= 0 && time_before(jiffies, watchdog_end_jiffies)))) {
> >  		watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> >  		add_timer_on(&watchdog_timer, next_cpu);
> >  	}
> > @@ -559,8 +564,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  
> >  static inline void clocksource_start_watchdog(void)
> >  {
> > -	if (watchdog_running || !watchdog || list_empty(&watchdog_list))
> > +	if (watchdog_running || !watchdog || list_empty(&watchdog_list) || !watchdog_duration)
> >  		return;
> > +	if (watchdog_duration > 0)
> > +		watchdog_end_jiffies = jiffies + watchdog_duration * 60 * HZ;
> >  	timer_setup(&watchdog_timer, clocksource_watchdog, 0);
> >  	watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
> >  	add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));