[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221220183400.GY4001@paulmck-ThinkPad-P17-Gen-1>
Date: Tue, 20 Dec 2022 10:34:00 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Waiman Long <longman@...hat.com>
Cc: Feng Tang <feng.tang@...el.com>,
John Stultz <john.stultz@...aro.org>,
Thomas Gleixner <tglx@...utronix.de>,
Stephen Boyd <sboyd@...nel.org>, x86@...nel.org,
Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...el.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when
high read lantency detected
On Tue, Dec 20, 2022 at 11:11:08AM -0500, Waiman Long wrote:
> On 12/20/22 03:25, Feng Tang wrote:
> > There were bug reported on 8 sockets x86 machines that TSC was wrongly
> > disabled when system is under heavy workload.
> >
> > [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
> > [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
> > [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
> > [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
> > [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
> > [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
> > [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
> > [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
> > [ 821.067990] clocksource: Switched to clocksource hpet
> >
> > This can be reproduced when system is running memory intensive 'stream'
> > test, or some stress-ng subcases like 'ioport'.
> >
> > The reason is when system is under heavy load, the read latency of
> > clocksource can be very high, it can be seen even with lightweight
> > TSC read, and is much worse on MMIO or IO port read based external
> > clocksource. Causing the watchdog check to be inaccurate.
> >
> > As the clocksource watchdog is a lifetime check with frequency of
> > twice a second, there is no need to rush doing it when the system
> > is under heavy load and the clocksource read latency is very high,
> > suspend the watchdog timer for 5 minutes.
> >
> > Signed-off-by: Feng Tang <feng.tang@...el.com>
> > ---
> > kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
> > 1 file changed, 32 insertions(+), 13 deletions(-)
> >
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 9cf32ccda715..8cd74b89d577 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
> > }
> > EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
> > +static inline void clocksource_reset_watchdog(void)
> > +{
> > + struct clocksource *cs;
> > +
> > + list_for_each_entry(cs, &watchdog_list, wd_list)
> > + cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > +}
> > +
> > +
> > static void clocksource_watchdog(struct timer_list *unused)
> > {
> > u64 csnow, wdnow, cslast, wdlast, delta;
> > @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > int64_t wd_nsec, cs_nsec;
> > struct clocksource *cs;
> > enum wd_read_status read_ret;
> > + unsigned long extra_wait = 0;
> > u32 md;
> > spin_lock(&watchdog_lock);
> > @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
> > read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > - if (read_ret != WD_READ_SUCCESS) {
> > - if (read_ret == WD_READ_UNSTABLE)
> > - /* Clock readout unreliable, so give it up. */
> > - __clocksource_unstable(cs);
> > + if (read_ret == WD_READ_UNSTABLE) {
> > + /* Clock readout unreliable, so give it up. */
> > + __clocksource_unstable(cs);
> > continue;
> > }
> > + /*
> > + * When WD_READ_SKIP is returned, it means the system is likely
> > + * under very heavy load, where the latency of reading
> > + * watchdog/clocksource is very big, and affect the accuracy of
> > + * watchdog check. So give system some space and suspend the
> > + * watchdog check for 5 minutes.
> > + */
> > + if (read_ret == WD_READ_SKIP) {
> > + /*
> > + * As the watchdog timer will be suspended, and
> > + * cs->last could keep unchanged for 5 minutes, reset
> > + * the counters.
> > + */
> > + clocksource_reset_watchdog();
> > + extra_wait = HZ * 300;
> > + break;
> > + }
> > +
> > /* Clocksource initialized ? */
> > if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > atomic_read(&watchdog_reset_pending)) {
> > @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> > */
> > if (!timer_pending(&watchdog_timer)) {
> > - watchdog_timer.expires += WATCHDOG_INTERVAL;
> > + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> > add_timer_on(&watchdog_timer, next_cpu);
> > }
> > out:
> > @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
> > watchdog_running = 0;
> > }
> > -static inline void clocksource_reset_watchdog(void)
> > -{
> > - struct clocksource *cs;
> > -
> > - list_for_each_entry(cs, &watchdog_list, wd_list)
> > - cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > -}
> > -
> > static void clocksource_resume_watchdog(void)
> > {
> > atomic_inc(&watchdog_reset_pending);
>
> It looks reasonable to me. Thanks for the patch.
>
> Acked-by: Waiman Long <longman@...hat.com>
Queued, thank you both!
If you would like this to go in some other way:
Acked-by: Paul E. McKenney <paulmck@...nel.org>
And while I am remembering it... Any objections to reversing the role of
TSC and the other timers on systems where TSC is believed to be accurate?
So that if there is clocksource skew, HPET is marked unstable rather than
TSC? This would preserve the diagnostics without hammering performance
when skew is detected. (Switching from TSC to HPET hammers performance
enough that our automation usually notices and reboots the system.)
Thanx, Paul
Powered by blists - more mailing lists