Date:   Mon, 18 Oct 2021 18:41:42 +0800
From:   yanghui <>
To:     John Stultz <>
Cc:     Thomas Gleixner <>,
        Stephen Boyd <>,
        lkml <>,
Subject: [PATCH] Clocksource: Avoid misjudgment of clocksource

在 2021/10/12 下午1:02, John Stultz 写道:
> On Sat, Oct 9, 2021 at 2:02 AM yanghui <> wrote:
>> 在 2021/10/9 上午11:38, John Stultz 写道:
>>> On Fri, Oct 8, 2021 at 8:22 PM yanghui <> wrote:
>>>> 在 2021/10/9 上午7:45, John Stultz 写道:
>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <> wrote:
>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>> being misjudged.
>>>>>> Signed-off-by: yanghui <>
>>>>>> ---
>>>>>>     kernel/time/clocksource.c | 6 +++++-
>>>>>>     1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>> --- a/kernel/time/clocksource.c
>>>>>> +++ b/kernel/time/clocksource.c
>>>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>     /*
>>>>>>      * Interval: 0.5sec.
>>>>>> + * MaxInterval: 1s.
>>>>>>      */
>>>>>>     #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>>     static void clocksource_watchdog_work(struct work_struct *work)
>>>>>>     {
>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>>>                    /* Check the deviation from the watchdog clocksource. */
>>>>>>                    md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>> bounding the clocksource delta here?
>>>>> It seems like if the clocksource being watched was very wrong (with a
>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>> Though one may want to keep track that if we are seeing too many of
>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>      Yes, only to check watchdog delta is more reasonable.
>>>>      I think Only have dmesg is not enough, because if tsc was be misjudged
>>>>      as unstable then switch to hpet. And hpet is very expensive for
>>>>      performance, so if we want to switch to tsc the only way is to reboot
>>>>      the server. We need to prevent the switching of the clock source in
>>>>      case of misjudgment.
>>>>      Circumstances of misjudgment:
>>>>      if clocksource_watchdog is executed after 10sec, the value of wd_delta
>>>>      and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
>>>>      will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
>>> Yea, it might be worth calculating an error rate instead of assuming
>>> the interval is fixed, but also just skipping the check may be
>>> reasonable assuming timers aren't constantly being delayed (and it's
>>> more of a transient state).
>>> At some point if the watchdog timer is delayed too much, the watchdog
>> I mean the execution cycle of this function(static void
>> clocksource_watchdog(struct timer_list *unused)) has been delayed.
>>> hardware will fully wrap and one can no longer properly compare
>>> intervals. That's why the timer length is chosen as such, so having
>>> that timer delayed is really pushing the system into a potentially bad
>>> state where other subtle problems are likely to crop up.
>>> So I do worry these watchdog robustness fixes are papering over a
>>> problem, pushing expectations closer to the edge of how far the system
>>> should tolerate bad behavior. Because at some point we'll fall off. :)
>> Sorry,I don't seem to understand what you mean. Should I send your Patch
>> v2 ?
> Sending a v2 is usually a good step (persistence is key! :)
> I'm sorry for being unclear in the above. I'm mostly just fretting
> that the watchdog logic has inherent assumptions that the timers won't
> be greatly delayed. Unfortunately the reality is that the timers may
> be delayed. So we can try to add some robustness (as your patch does),
> but at a certain point, the delays may exceed what the logic can
> tolerate and produce correct behavior. I worry that by pushing the
> robustness up to that limit, folks may not recognize the problematic
> behavior (greatly delayed timers - possibly caused by drivers
> disabling irqs for too long, or bad SMI logic, or long virtualization
> pauses), and think the system is still working as designed, even

I think we can increase the value of WATCHDOG_MAX_INTERVAL_NS up to
20sec(soft lockup time) or more longer. So we can filter those timer 
delays caused by non-softlockup as your said(drivers disabling irq, bad
SMI logic ...).
I think this method can solve the problem that the softlock is
too long and the clocksource is incorrectly switched, resulting
in performance degradation.
> though its regularly exceeding the bounds of the assumptions in the
> code. So without any feedback that something is wrong, those bounds
> will continue to be pushed until things really break in a way we
> cannot be robust about.
> That's why I was suggesting adding some sort of printk warning when we
> do see a number of delayed timers so that folks have some signal that
> things are not as they are expected to be.
> thanks
> -john

