[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8a9bed0d-c166-37e9-24c3-8cea7a336c76@redhat.com>
Date: Tue, 20 Dec 2022 22:26:15 -0500
From: Waiman Long <longman@...hat.com>
To: Feng Tang <feng.tang@...el.com>,
"Paul E. McKenney" <paulmck@...nel.org>
Cc: John Stultz <jstultz@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Stephen Boyd <sboyd@...nel.org>, x86@...nel.org,
Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...el.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when
high read lantency detected
On 12/20/22 20:01, Feng Tang wrote:
> Using correct email address of John Stultz.
>
> On Tue, Dec 20, 2022 at 10:34:00AM -0800, Paul E. McKenney wrote:
>> On Tue, Dec 20, 2022 at 11:11:08AM -0500, Waiman Long wrote:
>>> On 12/20/22 03:25, Feng Tang wrote:
>>>> There were bug reported on 8 sockets x86 machines that TSC was wrongly
>>>> disabled when system is under heavy workload.
>>>>
>>>> [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
>>>> [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
>>>> [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
>>>> [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
>>>> [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
>>>> [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
>>>> [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>>>> [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
>>>> [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
>>>> [ 821.067990] clocksource: Switched to clocksource hpet
>>>>
>>>> This can be reproduced when system is running memory intensive 'stream'
>>>> test, or some stress-ng subcases like 'ioport'.
>>>>
>>>> The reason is when system is under heavy load, the read latency of
>>>> clocksource can be very high, it can be seen even with lightweight
>>>> TSC read, and is much worse on MMIO or IO port read based external
>>>> clocksource. Causing the watchdog check to be inaccurate.
>>>>
>>>> As the clocksource watchdog is a lifetime check with frequency of
>>>> twice a second, there is no need to rush doing it when the system
>>>> is under heavy load and the clocksource read latency is very high,
>>>> suspend the watchdog timer for 5 minutes.
>>>>
>>>> Signed-off-by: Feng Tang <feng.tang@...el.com>
>>>> ---
>>>> kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
>>>> 1 file changed, 32 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>> index 9cf32ccda715..8cd74b89d577 100644
>>>> --- a/kernel/time/clocksource.c
>>>> +++ b/kernel/time/clocksource.c
>>>> @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
>>>> }
>>>> EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>>>> +static inline void clocksource_reset_watchdog(void)
>>>> +{
>>>> + struct clocksource *cs;
>>>> +
>>>> + list_for_each_entry(cs, &watchdog_list, wd_list)
>>>> + cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
>>>> +}
>>>> +
>>>> +
>>>> static void clocksource_watchdog(struct timer_list *unused)
>>>> {
>>>> u64 csnow, wdnow, cslast, wdlast, delta;
>>>> @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>> int64_t wd_nsec, cs_nsec;
>>>> struct clocksource *cs;
>>>> enum wd_read_status read_ret;
>>>> + unsigned long extra_wait = 0;
>>>> u32 md;
>>>> spin_lock(&watchdog_lock);
>>>> @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>> read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
>>>> - if (read_ret != WD_READ_SUCCESS) {
>>>> - if (read_ret == WD_READ_UNSTABLE)
>>>> - /* Clock readout unreliable, so give it up. */
>>>> - __clocksource_unstable(cs);
>>>> + if (read_ret == WD_READ_UNSTABLE) {
>>>> + /* Clock readout unreliable, so give it up. */
>>>> + __clocksource_unstable(cs);
>>>> continue;
>>>> }
>>>> + /*
>>>> + * When WD_READ_SKIP is returned, it means the system is likely
>>>> + * under very heavy load, where the latency of reading
>>>> + * watchdog/clocksource is very big, and affect the accuracy of
>>>> + * watchdog check. So give system some space and suspend the
>>>> + * watchdog check for 5 minutes.
>>>> + */
>>>> + if (read_ret == WD_READ_SKIP) {
>>>> + /*
>>>> + * As the watchdog timer will be suspended, and
>>>> + * cs->last could keep unchanged for 5 minutes, reset
>>>> + * the counters.
>>>> + */
>>>> + clocksource_reset_watchdog();
>>>> + extra_wait = HZ * 300;
>>>> + break;
>>>> + }
>>>> +
>>>> /* Clocksource initialized ? */
>>>> if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>>>> atomic_read(&watchdog_reset_pending)) {
>>>> @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>> * pair clocksource_stop_watchdog() clocksource_start_watchdog().
>>>> */
>>>> if (!timer_pending(&watchdog_timer)) {
>>>> - watchdog_timer.expires += WATCHDOG_INTERVAL;
>>>> + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
>>>> add_timer_on(&watchdog_timer, next_cpu);
>>>> }
>>>> out:
>>>> @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
>>>> watchdog_running = 0;
>>>> }
>>>> -static inline void clocksource_reset_watchdog(void)
>>>> -{
>>>> - struct clocksource *cs;
>>>> -
>>>> - list_for_each_entry(cs, &watchdog_list, wd_list)
>>>> - cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
>>>> -}
>>>> -
>>>> static void clocksource_resume_watchdog(void)
>>>> {
>>>> atomic_inc(&watchdog_reset_pending);
>>> It looks reasonable to me. Thanks for the patch.
>>>
>>> Acked-by: Waiman Long <longman@...hat.com>
>> Queued, thank you both!
> Thanks for reviewing and queueing!
>
>> If you would like this to go in some other way:
>>
>> Acked-by: Paul E. McKenney <paulmck@...nel.org>
>>
>> And while I am remembering it... Any objections to reversing the role of
>> TSC and the other timers on systems where TSC is believed to be accurate?
>> So that if there is clocksource skew, HPET is marked unstable rather than
>> TSC?
> For the bug in commit log, I think it's the 8 sockets system with
> hundreds of CPUs causing the big latency, while the HPET itself may
> not be broken, and if we switched to ACPI PM_TIMER as watchdog, we
> could see similar big latency.
>
> I used to only see this issue with stress tool like stress-ng, but
> seems with larger and larger system, even the momory intensive load
> can easily trigger this.
>
>> This would preserve the diagnostics without hammering performance
>> when skew is detected. (Switching from TSC to HPET hammers performance
>> enough that our automation usually notices and reboots the system.)
> Yes, switching to HPET is a disaster for performance, we've seen
> from 30% to 90% drop in different benchmarks.
Switching to hpet is very bad for performance. That is the main reason
why I posted clocksource patches in the past to avoid this as much as
possible. I think your patch is also a good countermeasure to avoid this.
Thanks,
Longman
Powered by blists - more mailing lists