linux-kernel - [PATCH] Clocksource: Avoid misjudgment of clocksource

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d5ed22b4-cb3a-1c69-b173-90598c5b8204@bytedance.com>
Date:   Mon, 18 Oct 2021 18:41:42 +0800
From:   yanghui <yanghui.def@...edance.com>
To:     John Stultz <john.stultz@...aro.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        lkml <linux-kernel@...r.kernel.org>, shli@...com
Subject: [PATCH] Clocksource: Avoid misjudgment of clocksource



在 2021/10/12 下午1:02, John Stultz 写道:
> On Sat, Oct 9, 2021 at 2:02 AM yanghui <yanghui.def@...edance.com> wrote:
>>
>>
>>
>> 在 2021/10/9 上午11:38, John Stultz 写道:
>>> On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@...edance.com> wrote:
>>>> 在 2021/10/9 上午7:45, John Stultz 写道:
>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@...edance.com> wrote:
>>>>>>
>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>> being misjudged.
>>>>>>
>>>>>> Signed-off-by: yanghui <yanghui.def@...edance.com>
>>>>>> ---
>>>>>>     kernel/time/clocksource.c | 6 +++++-
>>>>>>     1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>> --- a/kernel/time/clocksource.c
>>>>>> +++ b/kernel/time/clocksource.c
>>>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>
>>>>>>     /*
>>>>>>      * Interval: 0.5sec.
>>>>>> + * MaxInterval: 1s.
>>>>>>      */
>>>>>>     #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>>>
>>>>>>     static void clocksource_watchdog_work(struct work_struct *work)
>>>>>>     {
>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>>>
>>>>>>                    /* Check the deviation from the watchdog clocksource. */
>>>>>>                    md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>>
>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>> bounding the clocksource delta here?
>>>>> It seems like if the clocksource being watched was very wrong (with a
>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>
>>>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>>
>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>> Though one may want to keep track that if we are seeing too many of
>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>
>>>>      Yes, only to check watchdog delta is more reasonable.
>>>>      I think Only have dmesg is not enough, because if tsc was be misjudged
>>>>      as unstable then switch to hpet. And hpet is very expensive for
>>>>      performance, so if we want to switch to tsc the only way is to reboot
>>>>      the server. We need to prevent the switching of the clock source in
>>>>      case of misjudgment.
>>>>      Circumstances of misjudgment:
>>>>      if clocksource_watchdog is executed after 10sec, the value of wd_delta
>>>>      and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
>>>>      will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
>>>
>>> Yea, it might be worth calculating an error rate instead of assuming
>>> the interval is fixed, but also just skipping the check may be
>>> reasonable assuming timers aren't constantly being delayed (and it's
>>> more of a transient state).
>>>
>>> At some point if the watchdog timer is delayed too much, the watchdog
>> I mean the execution cycle of this function(static void
>> clocksource_watchdog(struct timer_list *unused)) has been delayed.
>>
>>> hardware will fully wrap and one can no longer properly compare
>>> intervals. That's why the timer length is chosen as such, so having
>>> that timer delayed is really pushing the system into a potentially bad
>>> state where other subtle problems are likely to crop up.
>>>
>>> So I do worry these watchdog robustness fixes are papering over a
>>> problem, pushing expectations closer to the edge of how far the system
>>> should tolerate bad behavior. Because at some point we'll fall off. :)
>>
>> Sorry,I don't seem to understand what you mean. Should I send your Patch
>> v2 ?
> 
> Sending a v2 is usually a good step (persistence is key! :)
> 
> I'm sorry for being unclear in the above. I'm mostly just fretting
> that the watchdog logic has inherent assumptions that the timers won't
> be greatly delayed. Unfortunately the reality is that the timers may
> be delayed. So we can try to add some robustness (as your patch does),
> but at a certain point, the delays may exceed what the logic can
> tolerate and produce correct behavior. I worry that by pushing the
> robustness up to that limit, folks may not recognize the problematic
> behavior (greatly delayed timers - possibly caused by drivers
> disabling irqs for too long, or bad SMI logic, or long virtualization
> pauses), and think the system is still working as designed, even

I think we can increase the value of WATCHDOG_MAX_INTERVAL_NS up to
20sec(soft lockup time) or more longer. So we can filter those timer 
delays caused by non-softlockup as your said(drivers disabling irq, bad
SMI logic ...).
I think this method can solve the problem that the softlock is
too long and the clocksource is incorrectly switched, resulting
in performance degradation.
> though its regularly exceeding the bounds of the assumptions in the
> code. So without any feedback that something is wrong, those bounds
> will continue to be pushed until things really break in a way we
> cannot be robust about.
> 
> That's why I was suggesting adding some sort of printk warning when we
> do see a number of delayed timers so that folks have some signal that
> things are not as they are expected to be.
> 
> thanks
> -john
>