[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <602b155f-4108-2865-3f1c-4e63d73405ed@yandex-team.ru>
Date: Sat, 18 May 2019 20:53:27 +0300
From: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Stephen Boyd <sboyd@...nel.org>,
John Stultz <john.stultz@...aro.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC] time: validate watchdog clocksource using second best
candidate
On 18.05.2019 18:17, Thomas Gleixner wrote:
> On Wed, 15 May 2019, Konstantin Khlebnikov wrote:
>
>> Timekeeping watchdog verifies doubtful clocksources using more reliable
>> candidates. For x86 it likely verifies 'tsc' using 'hpet'. But 'hpet'
>> is far from perfect too. It's better to have second opinion if possible.
>>
>> We're seeing sudden jumps of hpet counter to 0xffffffff:
>
> On which kind of hardware? A particular type of CPU or random ones?
In general this is very rare event.
This exact pattern have been seen ten times or so on several servers with
Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
(this custom built platform with chipset Intel C610)
and haven't seen for previous generation
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
(this is another custom built platform)
So, this might be not related to cpu model.
>
>> timekeeping watchdog on CPU56: Marking clocksource 'tsc' as unstable because the skew is too large:
>> 'hpet' wd_now: ffffffff wd_last: 19ec5720 mask: ffffffff
>> 'tsc' cs_now: 69b8a15f0aed cs_last: 69b862c9947d mask: ffffffffffffffff
>>
>> Shaohua Li reported the same case three years ago.
>> His patch backlisted this exact value and re-read hpet counter.
>
> Can you provide a reference please? Preferrably a lore.kernel.org/... URL
Link was in patch: https://lore.kernel.org/patchwork/patch/667413/
>
>> This patch uses second reliable clocksource as backup for validation.
>> For x86 this is usually 'acpi_pm'. If watchdog and backup are not consent
>> then other clocksources will not be marked as unstable at this iteration.
>
> The mess you add to the watchdog code is unholy and that's broken as there
> is no guarantee for acpi_pm (or any other secondary watchdog) being
> available.
ACPI power management timer is a pretty standard x86 hardware.
But my patch should work for any platform with any second reliable clocksource.
If there is no second clocksource my patch does noting:
watchdog_backup stays NULL and backup_consent always true.
>
> If the only wreckaged value is always ffffffff then I rather reread the
> hpet in that case. But not in the watchdog code, we need to do that in the
> HPET code as this affects any other HPET user as well.
>
> Thanks,
>
> tglx
>
Powered by blists - more mailing lists