linux-kernel - Re: [PATCH v2] clocksource: Skip watchdog check for large watchdog intervals

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240113114400.GH3303@incl>
Date: Sat, 13 Jan 2024 12:44:00 +0100
From: Jiri Wiesner <jwiesner@...e.de>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org, John Stultz <jstultz@...gle.com>,
	Stephen Boyd <sboyd@...nel.org>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Feng Tang <feng.tang@...el.com>
Subject: Re: [PATCH v2] clocksource: Skip watchdog check for large watchdog
 intervals

On Fri, Jan 12, 2024 at 05:48:22PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 10 2024 at 20:26, Jiri Wiesner wrote:
> > The measured clocksource skew - the absolute difference between cs_nsec
> > and wd_nsec - was 668 microseconds:
> >> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> >
> > The kernel (based on 5.14.21) used 200 microseconds for the
> > uncertainty_margin of both the clocksource and watchdog, resulting in a
> > threshold of 400 microseconds.  The discrepancy is that the measured
> > clocksource skew was evaluated against a threshold suited for watchdog
> > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5
> > second.
> 
> This really took some time to decode. What you are trying to explain is:
> 
>    The comparison between the clocksource and the watchdog is not
>    working for large readout intervals because the conversion to
>    nanoseconds is imprecise. The reason is that the initialization
>    values of the shift/mult pairs which are used for conversion are not
>    sufficiently accurate and the accumulated inaccuracy causes the
>    comparison to exceed the threshold.

The root cause of the bug does not concern the precision of the conversion 
to nanoseconds. The shift/mult pair of the TSC can convert diffs as large 
as 600 seconds. The HPET is limited to 179.0 seconds on account of being a 
32-bit counter. The acpi_pm can convert only 4.7 seconds. With the 
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE option enabled, the ranges are 
reduced to a half. The example above showed the TSC as the clocksource and 
the HPET as a watchdog both of which should be able to convert a diff of 
14.5 seconds to nanoseconds with sufficient precision.

I could change the description to:

The kernel (based on 5.14.21) used 200 microseconds for the 
uncertainty_margin of both the clocksource and watchdog, resulting in a 
threshold of 400 microseconds (the md variable). The root cause of the 
issue is that the measured clocksource skew, 668 microseconds, was 
evaluated against a threshold (the md variable) which is suited for 
watchdog intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 
0.5 second. Both the cs_nsec and the wd_nsec value indicate that the 
watchdog interval was circa 14.5 seconds.

The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew 
threshold") was to tighten the threshold for evaluating skew and set the 
lower bound for the uncertainty_margin of clocksources to twice 
WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource 
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to 
125 microseconds to fit the limit of NTP, which is able to use 
a clocksource that suffers from up to 500 microseconds of skew per second. 
Both the TSC and the HPET use default uncertainty_margin. When the 
watchdog interval gets stretched the default uncertainty_margin is no 
longer a suitable lower bound for evaluating skew - it imposes a limit 
that is stricter than the skew with which NTP can deal. The longer the 
watchdog interval is the larger the threshold should be. For evaluating 
skew in a watchdog interval of 14.5 seconds, a proportional threshold 
should be used, which should be 14500 microseconds (7250 coming from the 
TSC, 7250 coming from the HPET).

> So yes, limiting the maximum readout interval and skipping the check is
> sensible.

It is a bug to mark a clocksource unstable if the skew is 668 microseconds 
in 14.5 seconds. One possible solution is to skip the check. I originally 
posted a patch scaling the uncertainty_margin of clocksources but it got 
no support and the feedback I got was to avoid the calculation and skip 
the current check in order to keep the code simple:
https://lore.kernel.org/lkml/20231221160517.GA22919@incl/#t
Since skipping the check solves issue as well I sent a patch.

> >  /*
> >   * Interval: 0.5sec.
> >   */
> > -#define WATCHDOG_INTERVAL (HZ >> 1)
> > +#define WATCHDOG_INTERVAL	(HZ >> 1)
> > +#define WATCHDOG_INTR_MAX_NS	((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> > +				 * (NSEC_PER_SEC / HZ))
> 
> That 1.5 * WATCHDOG_INTERVAL seems to be rather arbitrary. One second
> should be safe enough, no?

Yes, it is arbitrary. The concern is how strict can we allow the skew 
check to get. 2 * WATCHDOG_INTERVAL would mean imposing a skew threshold 
of 250 microseconds per second for intervals that are close in their value 
to 2 * WATCHDOG_INTERVAL. Even using 2 * WATCHDOG_INTERVAL would still be 
many times better than using 500 microseconds to check skew in a 14.5-long 
watchdog interval.

> > +		/*
> > +		 * The processing of timer softirqs can get delayed (usually
> > +		 * on account of ksoftirqd not getting to run in a timely
> > +		 * manner), which causes the watchdog interval to stretch.
> > +		 * Some clocksources, e.g. acpi_pm, cannot tolerate
> > +		 * watchdog intervals longer than a few seconds.
> 
> What ensures that the watchdog did not wrap around then?

Nothing. It has always been this way. The check usually fails when the 
watchdog wraps around, in which case the clocksource is marked unstable 
for no fault of its own.

> > +				watchdog_max_intr = interval;
> > +				pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> > +					cs_nsec, wd_nsec);
> 
> This really wants to have a proper indication why the check was skipped,
> i,e. due to a long readout interval, no?

It could be changed to:
pr_warn("Large watchdog interval, skipping check: cs_nsec: %lld wd_nsec: %lld\n",

I will send a v3 incorporation all the suggestions after we have made the 
description intelligible. Thank you for the feedback.
-- 
Jiri Wiesner
SUSE Labs