[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d5ed22b4-cb3a-1c69-b173-90598c5b8204@bytedance.com>
Date: Mon, 18 Oct 2021 18:41:42 +0800
From: yanghui <yanghui.def@...edance.com>
To: John Stultz <john.stultz@...aro.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Stephen Boyd <sboyd@...nel.org>,
lkml <linux-kernel@...r.kernel.org>, shli@...com
Subject: [PATCH] Clocksource: Avoid misjudgment of clocksource
在 2021/10/12 下午1:02, John Stultz 写道:
> On Sat, Oct 9, 2021 at 2:02 AM yanghui <yanghui.def@...edance.com> wrote:
>>
>>
>>
>> 在 2021/10/9 上午11:38, John Stultz 写道:
>>> On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@...edance.com> wrote:
>>>> 在 2021/10/9 上午7:45, John Stultz 写道:
>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@...edance.com> wrote:
>>>>>>
>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>> being misjudged.
>>>>>>
>>>>>> Signed-off-by: yanghui <yanghui.def@...edance.com>
>>>>>> ---
>>>>>> kernel/time/clocksource.c | 6 +++++-
>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>> --- a/kernel/time/clocksource.c
>>>>>> +++ b/kernel/time/clocksource.c
>>>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>
>>>>>> /*
>>>>>> * Interval: 0.5sec.
>>>>>> + * MaxInterval: 1s.
>>>>>> */
>>>>>> #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>>>
>>>>>> static void clocksource_watchdog_work(struct work_struct *work)
>>>>>> {
>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>>>
>>>>>> /* Check the deviation from the watchdog clocksource. */
>>>>>> md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>>>> - if (abs(cs_nsec - wd_nsec) > md) {
>>>>>> + if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>>
>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>> bounding the clocksource delta here?
>>>>> It seems like if the clocksource being watched was very wrong (with a
>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>
>>>>>> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>>
>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>> Though one may want to keep track that if we are seeing too many of
>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>
>>>> Yes, only to check watchdog delta is more reasonable.
>>>> I think Only have dmesg is not enough, because if tsc was be misjudged
>>>> as unstable then switch to hpet. And hpet is very expensive for
>>>> performance, so if we want to switch to tsc the only way is to reboot
>>>> the server. We need to prevent the switching of the clock source in
>>>> case of misjudgment.
>>>> Circumstances of misjudgment:
>>>> if clocksource_watchdog is executed after 10sec, the value of wd_delta
>>>> and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
>>>> will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
>>>
>>> Yea, it might be worth calculating an error rate instead of assuming
>>> the interval is fixed, but also just skipping the check may be
>>> reasonable assuming timers aren't constantly being delayed (and it's
>>> more of a transient state).
>>>
>>> At some point if the watchdog timer is delayed too much, the watchdog
>> I mean the execution cycle of this function(static void
>> clocksource_watchdog(struct timer_list *unused)) has been delayed.
>>
>>> hardware will fully wrap and one can no longer properly compare
>>> intervals. That's why the timer length is chosen as such, so having
>>> that timer delayed is really pushing the system into a potentially bad
>>> state where other subtle problems are likely to crop up.
>>>
>>> So I do worry these watchdog robustness fixes are papering over a
>>> problem, pushing expectations closer to the edge of how far the system
>>> should tolerate bad behavior. Because at some point we'll fall off. :)
>>
>> Sorry,I don't seem to understand what you mean. Should I send your Patch
>> v2 ?
>
> Sending a v2 is usually a good step (persistence is key! :)
>
> I'm sorry for being unclear in the above. I'm mostly just fretting
> that the watchdog logic has inherent assumptions that the timers won't
> be greatly delayed. Unfortunately the reality is that the timers may
> be delayed. So we can try to add some robustness (as your patch does),
> but at a certain point, the delays may exceed what the logic can
> tolerate and produce correct behavior. I worry that by pushing the
> robustness up to that limit, folks may not recognize the problematic
> behavior (greatly delayed timers - possibly caused by drivers
> disabling irqs for too long, or bad SMI logic, or long virtualization
> pauses), and think the system is still working as designed, even
I think we can increase the value of WATCHDOG_MAX_INTERVAL_NS up to
20sec(soft lockup time) or more longer. So we can filter those timer
delays caused by non-softlockup as your said(drivers disabling irq, bad
SMI logic ...).
I think this method can solve the problem that the softlock is
too long and the clocksource is incorrectly switched, resulting
in performance degradation.
> though its regularly exceeding the bounds of the assumptions in the
> code. So without any feedback that something is wrong, those bounds
> will continue to be pushed until things really break in a way we
> cannot be robust about.
>
> That's why I was suggesting adding some sort of printk warning when we
> do see a number of delayed timers so that folks have some signal that
> things are not as they are expected to be.
>
> thanks
> -john
>
Powered by blists - more mailing lists