linux-kernel - Re: [PATCH RFC] clocksource: Detect a watchdog overflow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALAqxLUnV4rmhcwxyEddEpha2v5=Z+g6L8_zGme2LPMxeE02Ng@mail.gmail.com>
Date:	Wed, 6 Apr 2016 15:21:50 -0700
From:	John Stultz <john.stultz@...aro.org>
To:	Gratian Crisan <gratian.crisan@...com>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	lkml <linux-kernel@...r.kernel.org>,
	Gratian Crisan <gratian@...il.com>
Subject: Re: [PATCH RFC] clocksource: Detect a watchdog overflow

On Tue, Mar 15, 2016 at 11:50 AM, Gratian Crisan <gratian.crisan@...com> wrote:
> The clocksource watchdog can falsely trigger and disable the main
> clocksource when the watchdog wraps around.
>
> The reason is that an interrupt storm and/or high priority (FIFO/RR) tasks
> can preempt the timer softirq long enough for the watchdog to wrap around
> if it has a limited number of bits available by comparison to the main
> clocksource. One observed example is on a Intel Baytrail platform where TSC
> is the main clocksource, HPET is disabled due to a hardware bug and acpi_pm
> gets selected as the watchdog clocksource.
>
> Calculate the maximum number of nanoseconds the watchdog clocksource can
> represent without overflow and do not disqualify the main clocksource if
> the delta since the last time we have checked exceeds the measurement
> capabilities of the watchdog clocksource.

Sorry for not getting back to you sooner on this. You managed to send
these both out while I was at a conference and on vacation, and so
they were deep in the mail backlog. :)

So I'm sympathetic to this issue, because I remember seeing similar
problems w/ runaway SCHED_FIFO tasks w/ PREEMPT_RT.

However, its really difficult to create a solution without opening new
cases where bad clocksources will be mis-identified as good (which
your solution seems to suffer as well, measuring the time past with a
known bad clocksource can easily result in large deltas, which will be
ignored if the watchdog has a short interval).

A previous effort on this was made here, and there's a resulting
thread that didn't come to resolution:
    https://lkml.org/lkml/2015/8/17/542

Way back I had tried to come up with an approach where if the time
delta was large, it was divided by the watchdog interval, and then we
just compared the remainder with the current watchdog delta to see if
they matched (closely enough). Unfortunately this didn't work out for
me then, but perhaps it deserves a second try?

thanks
-john