[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6fb04ee9-ce77-4835-2ad1-b7f8419cfb77@redhat.com>
Date: Tue, 20 Dec 2022 11:11:08 -0500
From: Waiman Long <longman@...hat.com>
To: Feng Tang <feng.tang@...el.com>,
John Stultz <john.stultz@...aro.org>,
Thomas Gleixner <tglx@...utronix.de>,
Stephen Boyd <sboyd@...nel.org>, x86@...nel.org,
Peter Zijlstra <peterz@...radead.org>,
"Paul E . McKenney" <paulmck@...nel.org>
Cc: linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...el.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when
high read lantency detected
On 12/20/22 03:25, Feng Tang wrote:
> There were bug reported on 8 sockets x86 machines that TSC was wrongly
> disabled when system is under heavy workload.
>
> [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
> [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
> [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
> [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
> [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
> [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
> [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
> [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
> [ 821.067990] clocksource: Switched to clocksource hpet
>
> This can be reproduced when system is running memory intensive 'stream'
> test, or some stress-ng subcases like 'ioport'.
>
> The reason is when system is under heavy load, the read latency of
> clocksource can be very high, it can be seen even with lightweight
> TSC read, and is much worse on MMIO or IO port read based external
> clocksource. Causing the watchdog check to be inaccurate.
>
> As the clocksource watchdog is a lifetime check with frequency of
> twice a second, there is no need to rush doing it when the system
> is under heavy load and the clocksource read latency is very high,
> suspend the watchdog timer for 5 minutes.
>
> Signed-off-by: Feng Tang <feng.tang@...el.com>
> ---
> kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
> 1 file changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 9cf32ccda715..8cd74b89d577 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
> }
> EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>
> +static inline void clocksource_reset_watchdog(void)
> +{
> + struct clocksource *cs;
> +
> + list_for_each_entry(cs, &watchdog_list, wd_list)
> + cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> +}
> +
> +
> static void clocksource_watchdog(struct timer_list *unused)
> {
> u64 csnow, wdnow, cslast, wdlast, delta;
> @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> int64_t wd_nsec, cs_nsec;
> struct clocksource *cs;
> enum wd_read_status read_ret;
> + unsigned long extra_wait = 0;
> u32 md;
>
> spin_lock(&watchdog_lock);
> @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
>
> read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
>
> - if (read_ret != WD_READ_SUCCESS) {
> - if (read_ret == WD_READ_UNSTABLE)
> - /* Clock readout unreliable, so give it up. */
> - __clocksource_unstable(cs);
> + if (read_ret == WD_READ_UNSTABLE) {
> + /* Clock readout unreliable, so give it up. */
> + __clocksource_unstable(cs);
> continue;
> }
>
> + /*
> + * When WD_READ_SKIP is returned, it means the system is likely
> + * under very heavy load, where the latency of reading
> + * watchdog/clocksource is very big, and affect the accuracy of
> + * watchdog check. So give system some space and suspend the
> + * watchdog check for 5 minutes.
> + */
> + if (read_ret == WD_READ_SKIP) {
> + /*
> + * As the watchdog timer will be suspended, and
> + * cs->last could keep unchanged for 5 minutes, reset
> + * the counters.
> + */
> + clocksource_reset_watchdog();
> + extra_wait = HZ * 300;
> + break;
> + }
> +
> /* Clocksource initialized ? */
> if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> atomic_read(&watchdog_reset_pending)) {
> @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> */
> if (!timer_pending(&watchdog_timer)) {
> - watchdog_timer.expires += WATCHDOG_INTERVAL;
> + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> add_timer_on(&watchdog_timer, next_cpu);
> }
> out:
> @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
> watchdog_running = 0;
> }
>
> -static inline void clocksource_reset_watchdog(void)
> -{
> - struct clocksource *cs;
> -
> - list_for_each_entry(cs, &watchdog_list, wd_list)
> - cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> -}
> -
> static void clocksource_resume_watchdog(void)
> {
> atomic_inc(&watchdog_reset_pending);
It looks reasonable to me. Thanks for the patch.
Acked-by: Waiman Long <longman@...hat.com>
Powered by blists - more mailing lists