linux-kernel - Re: [PATCH] clocksource: Scale the max retry number of watchdog read according to CPU numbers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3b3d9499-dc7a-4ea0-9615-526940f5fcf2@paulmck-laptop>
Date: Fri, 26 Jan 2024 04:27:19 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Feng Tang <feng.tang@...el.com>
Cc: John Stultz <jstultz@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
	Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Waiman Long <longman@...hat.com>, Jin Wang <jin1.wang@...el.com>
Subject: Re: [PATCH] clocksource: Scale the max retry number of watchdog read
 according to CPU numbers

On Fri, Jan 26, 2024 at 05:12:50PM +0800, Feng Tang wrote:
> There was a bug on one 8-socket server that the TSC is wrongly marked as
> 'unstable' and disabled during boot time. (reproduce rate is every 120
> rounds of reboot tests), with log:
> 
>     clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
>     wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
>     tsc: Marking TSC unstable due to clocksource watchdog
>     TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>     sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
>     clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
>     clocksource: Switched to clocksource hpet
> 
> The reason is for platform with lots of CPU, there are sporadic big or huge
> read latency of read watchog/clocksource during boot or when system is under
> stress work load, and the frequency and maximum value of the latency goes up
> with the increasing of CPU numbers. Current code already has logic to detect
> and filter such high latency case by reading 3 times of watchdog, and check
> the 2 deltas. Due to the randomness of the latency, there is a low possibility
> situation that the first delta (latency) is big, but the second delta is small
> and looks valid, which can escape from the check, and there is a
> 'max_cswd_read_retries' for retrying that check covering this case, whose
> default value is only 2 and may be not enough for machines with huge number
> of CPUs.
> 
> So scale and enlarge the max retry number according to CPU number to better
> filter those latency noise on large system, which has been verified fine in
> 4 days and 670 rounds of reboot test on the 8-socket machine.
> 
> Tested-by: Jin Wang <jin1.wang@...el.com>
> Signed-off-by: Feng Tang <feng.tang@...el.com>
> ---
>  kernel/time/clocksource.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..f15283101906 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
>  	u64 wd_end, wd_end2, wd_delta;
>  	int64_t wd_delay, wd_seq_delay;
>  
> +	/*
> +	 * If no user changes the default value, scale the retry threshold
> +	 * according to CPU numbers. As per test, the more CPU a platform has,
> +	 * the bigger read latency is found during boot time or under stress
> +	 * work load. Increase the try nubmer to reduce false alarms.
> + 	 */
> +	if (max_cswd_read_retries == 2)

I like the self-adjusting behavior!

But why not make max_cswd_read_retries be a long (instead of a ulong)
defaulting to -1.  Then when someone sets it to a specific positive
value, they get exactly that value, while leaving it -1 gets the default
CPU-scaling behavior.  Zero and other negative values should get a
warning in order to reserve the for possible future use.

I also suggest doing the adjustment at boot time, for example, using
an early_initcall().  That way the test code also sees the scaled value.

							Thanx, Paul

> +		max_cswd_read_retries = max(2, ilog2(num_online_cpus()));
> +
>  	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
>  		local_irq_disable();
>  		*wdnow = watchdog->read(watchdog);
> -- 
> 2.34.1
>