linux-kernel - Re: [PATCH v4] clocksource: Scale the max retry number of watchdog read according to CPU numbers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2996a3b1-33db-4a58-aa69-fd13cdbb1eee@paulmck-laptop>
Date: Tue, 20 Feb 2024 09:47:11 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Feng Tang <feng.tang@...el.com>
Cc: John Stultz <jstultz@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
	Stephen Boyd <sboyd@...nel.org>, Jonathan Corbet <corbet@....net>,
	Peter Zijlstra <peterz@...radead.org>,
	Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org,
	Jin Wang <jin1.wang@...el.com>
Subject: Re: [PATCH v4] clocksource: Scale the max retry number of watchdog
 read according to CPU numbers

On Tue, Feb 20, 2024 at 11:43:02PM +0800, Feng Tang wrote:
> There was a bug on one 8-socket server that the TSC is wrongly marked
> as 'unstable' and disabled during boot time (reproduce rate is about
> every 120 rounds of reboot tests), with log:
> 
>     clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
>     wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
>     tsc: Marking TSC unstable due to clocksource watchdog
>     TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>     sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
>     clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
>     clocksource: Switched to clocksource hpet
> 
> The reason is for platform with lots of CPU, there are sporadic big or
> huge read latency of read watchog/clocksource during boot or when system
> is under stress work load, and the frequency and maximum value of the
> latency goes up with the increasing of CPU numbers. Current code already
> has logic to detect and filter such high latency case by reading 3 times
> of watchdog, and check the 2 deltas. Due to the randomness of the
> latency, there is a low possibility situation that the first delta
> (latency) is big, but the second delta is small and looks valid, which
> can escape from the check, and there is a 'max_cswd_read_retries' for
> retrying that check covering this case, whose default value is only 2
> and may be not enough for machines with huge number of CPUs.
> 
> So scale and enlarge the max retry number according to CPU number to
> better filter those latency noise for large systems, which has been
> verified fine in 4 days reboot test on the 8-socket machine.
> 
> Also as suggested by Thomas, remove parameter 'max_cswd_read_retries'
> which was originally introduced to cover this.
> 
> Signed-off-by: Feng Tang <feng.tang@...el.com>
> Tested-by: Jin Wang <jin1.wang@...el.com>
> Tested-by: Paul E. McKenney <paulmck@...nel.org>
> Reviewed-by: Waiman Long <longman@...hat.com>
> ---
>  
> Hi Paul, Waiman,
> 
> I keep your 'Tested-by' and 'Reviewed-by' tag for v3, as I think the
> core logic of the patch isn't changed. Please let me know if you
> think otherwise. thanks!

I retested, and all went well, so please keep my Tested-by.

One nit below...

							Thanx, Paul

> Changelog:
> 
>     since v3:
>       * Remove clocksource's module parameter 'max_cswd_read_retries' (Thomas)
>       * Use "ilog4" instead of ilog2 for max retry calculation, and
>         may be adjusted later (Paul)
> 
>     since v2:
>       * Fix the unexported symbol of helper function being used by
>         kernel module issue (Waiman)
> 
>     since v1:
>       * Add santity check for user input value of 'max_cswd_read_retries'
>         and a helper function for getting max retry nubmer (Paul)
>       * Apply the same logic to watchdog test code (Waiman)
> 
> 
>  Documentation/admin-guide/kernel-parameters.txt  |  6 ------
>  include/linux/clocksource.h                      |  1 -
>  kernel/time/clocksource-wdtest.c                 | 13 +++++++------
>  kernel/time/clocksource.c                        | 16 +++++++++++-----
>  .../testing/selftests/rcutorture/bin/torture.sh  |  2 +-
>  5 files changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 31b3a25680d0..763e96dcf8b1 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -679,12 +679,6 @@
>  			loops can be debugged more effectively on production
>  			systems.
>  
> -	clocksource.max_cswd_read_retries= [KNL]
> -			Number of clocksource_watchdog() retries due to
> -			external delays before the clock will be marked
> -			unstable.  Defaults to two retries, that is,
> -			three attempts to read the clock under test.
> -
>  	clocksource.verify_n_cpus= [KNL]
>  			Limit the number of CPUs checked for clocksources
>  			marked with CLOCK_SOURCE_VERIFY_PERCPU that
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 1d42d4b17327..b93f18270b5c 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -291,7 +291,6 @@ static inline void timer_probe(void) {}
>  #define TIMER_ACPI_DECLARE(name, table_id, fn)		\
>  	ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
>  
> -extern ulong max_cswd_read_retries;
>  void clocksource_verify_percpu(struct clocksource *cs);
>  
>  #endif /* _LINUX_CLOCKSOURCE_H */
> diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> index df922f49d171..d1025f956fab 100644
> --- a/kernel/time/clocksource-wdtest.c
> +++ b/kernel/time/clocksource-wdtest.c
> @@ -105,7 +105,7 @@ static int wdtest_func(void *arg)
>  {
>  	unsigned long j1, j2;
>  	char *s;
> -	int i;
> +	int i, max_retries;
>  
>  	schedule_timeout_uninterruptible(holdoff * HZ);
>  
> @@ -139,18 +139,19 @@ static int wdtest_func(void *arg)
>  	WARN_ON_ONCE(time_before(j2, j1 + NSEC_PER_USEC));
>  
>  	/* Verify tsc-like stability with various numbers of errors injected. */
> -	for (i = 0; i <= max_cswd_read_retries + 1; i++) {
> -		if (i <= 1 && i < max_cswd_read_retries)
> +	max_retries = ilog2(num_online_cpus()) / 2 + 1;

Please pull this into a function so that the two calculations of
max_retries are automatically in synchronization with each other.

> +	for (i = 0; i <= max_retries + 1; i++) {
> +		if (i <= 1 && i < max_retries)
>  			s = "";
> -		else if (i <= max_cswd_read_retries)
> +		else if (i <= max_retries)
>  			s = ", expect message";
>  		else
>  			s = ", expect clock skew";
> -		pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> +		pr_info("--- Watchdog with %dx error injection, %d retries%s.\n", i, max_retries, s);
>  		WRITE_ONCE(wdtest_ktime_read_ndelays, i);
>  		schedule_timeout_uninterruptible(2 * HZ);
>  		WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> -		WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> +		WARN_ON_ONCE((i <= max_retries) !=
>  			     !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
>  		wdtest_ktime_clocksource_reset();
>  	}
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 3052b1f1168e..9def0e39f43a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -210,9 +210,6 @@ void clocksource_mark_unstable(struct clocksource *cs)
>  	spin_unlock_irqrestore(&watchdog_lock, flags);
>  }
>  
> -ulong max_cswd_read_retries = 2;
> -module_param(max_cswd_read_retries, ulong, 0644);
> -EXPORT_SYMBOL_GPL(max_cswd_read_retries);
>  static int verify_n_cpus = 8;
>  module_param(verify_n_cpus, int, 0644);
>  
> @@ -227,8 +224,17 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
>  	unsigned int nretries;
>  	u64 wd_end, wd_end2, wd_delta;
>  	int64_t wd_delay, wd_seq_delay;
> +	int max_retries;
>  
> -	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> +	/*
> +	 * When system is in boot phase or under heavy workload, there could
> +	 * be random big latency during clocksource/watchdog read, so add
> +	 * some retry to filter the noise latency. As the latency's frequency
> +	 * and maximum value goes up with the CPU numbers relatively, chose
> +	 * the max retry number according to CPU numbers.
> +	 */
> +	max_retries = ilog2(num_online_cpus()) / 2 + 1;

And here is the other instance to be kept in synchronization.  ;-)

> +	for (nretries = 0; nretries <= max_retries; nretries++) {
>  		local_irq_disable();
>  		*wdnow = watchdog->read(watchdog);
>  		*csnow = cs->read(cs);
> @@ -240,7 +246,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
>  		wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
>  					      watchdog->shift);
>  		if (wd_delay <= WATCHDOG_MAX_SKEW) {
> -			if (nretries > 1 || nretries >= max_cswd_read_retries) {
> +			if (nretries > 1 || nretries >= max_retries) {
>  				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
>  					smp_processor_id(), watchdog->name, nretries);
>  			}
> diff --git a/tools/testing/selftests/rcutorture/bin/torture.sh b/tools/testing/selftests/rcutorture/bin/torture.sh
> index d5a0d8a33c27..bbac5f4b03d0 100755
> --- a/tools/testing/selftests/rcutorture/bin/torture.sh
> +++ b/tools/testing/selftests/rcutorture/bin/torture.sh
> @@ -567,7 +567,7 @@ then
>  	torture_bootargs="rcupdate.rcu_cpu_stall_suppress_at_boot=1 torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=30000 tsc=watchdog"
>  	torture_set "clocksourcewd-1" tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 45s --configs TREE03 --kconfig "CONFIG_TEST_CLOCKSOURCE_WATCHDOG=y" --trust-make
>  
> -	torture_bootargs="rcupdate.rcu_cpu_stall_suppress_at_boot=1 torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=30000 clocksource.max_cswd_read_retries=1 tsc=watchdog"
> +	torture_bootargs="rcupdate.rcu_cpu_stall_suppress_at_boot=1 torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=30000 tsc=watchdog"
>  	torture_set "clocksourcewd-2" tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 45s --configs TREE03 --kconfig "CONFIG_TEST_CLOCKSOURCE_WATCHDOG=y" --trust-make
>  
>  	# In case our work is already done...
> -- 
> 2.34.1
>