linux-kernel - Re: [PATCH] Clocksource: Avoid misjudgment of clocksource

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJRGBZxL4ofhdhMTAaravWyCLsg8Pghtd+J9LGwBwzF2f6MKng@mail.gmail.com>
Date:   Mon, 1 Nov 2021 05:59:52 -0400
From:   Luming Yu <luming.yu@...il.com>
To:     paulmck@...nel.org
Cc:     John Stultz <john.stultz@...aro.org>,
        yanghui <yanghui.def@...edance.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        lkml <linux-kernel@...r.kernel.org>, Shaohua Li <shli@...com>
Subject: Re: [PATCH] Clocksource: Avoid misjudgment of clocksource

On Fri, Oct 22, 2021 at 7:36 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> > On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@...aro.org> wrote:
> > > > >
> > > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@...edance.com> wrote:
> > > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@...il.com> wrote:
> > > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@...il.com> wrote:
> > > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@...il.com> wrote:
> > > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > > >>>> verification? Do you think this idea is okay?
> > > > > > >>>
> > > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > > >>>
> > > > > > >>
> > > > > > >> Sorry, it looks almost as follows:
> > > > > > >>
> > > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > > >> index b8a14d2..87f3b67 100644
> > > > > > >> --- a/kernel/time/clocksource.c
> > > > > > >> +++ b/kernel/time/clocksource.c
> > > > > > >> @@ -119,6 +119,7 @@
> > > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > > >>   static int watchdog_running;
> > > > > > >> +static unsigned long watchdog_start_time;
> > > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > > >>
> > > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > >>          int next_cpu, reset_pending;
> > > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > > >>          struct clocksource *cs;
> > > > > > >> +       unsigned long max_jiffies;
> > > > > > >>          u32 md;
> > > > > > >>
> > > > > > >>          spin_lock(&watchdog_lock);
> > > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > > >>                          continue;
> > > > > > >>
> > > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > > >> +                       continue;
> > > > > > >> +
> > > > > > >
> > > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > > updated by counting the number of tick intervals on the current
> > > > > > > clocksource.
> > > > > > >
> > > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > > current clocksource by using something we derived from the current
> > > > > > > clocksource.
> > > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > > the preferred clocksource.
> > > > > > >
> > > > > > > So I'm not really sure how this helps.
> > > > > > >
> > > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > > >
> > > > > > I think only have a signal ls not enough. we need to prevent
> > > > > > clocksource from being incorrectly switched.
> > > > >
> > > > > Right, but we also have to ensure that we also properly disqualify
> > > > > clocksources that are misbehaving.
> > > > >
> > > > > In the case that the current clocksource is running very slow (imagine
> > > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > > so timers fire late.
> > > > > So it would constantly seem like the irqs are being delayed, so with
> > > > > your logic we would not disqualify a clearly malfunctioning
> > > > > clocksource..
> > > > >
> > > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > > will be delay executed.
> > > > >
> > > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > > want a short burst of timer misbehavior to disqualify a good
> > > > > clocksource.
> > > > >
> > > > > But the problem is that this situation and the one above (with the
> > > > > freq changing TSC), will look exactly the same.
> > > > >
> > > > > So having a situation where if the watchdog clocksource thinks too
> > > > > much time has passed between watchdog timers, we can skip judgement,
> > > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > > misbehavior and judge the current clocksource, as it may be running
> > > > > slowly.
> > > > >
> > > > > > >
> > > > > > I think it will be better to add this to my patch:
> > > > > >   /*
> > > > > >    * Interval: 0.5sec.
> > > > > > - * MaxInterval: 1s.
> > > > > > + * MaxInterval: 20s.
> > > > > >    */
> > > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > > >
> > > > >
> > > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > > is a good idea.
> > > > >
> > > > > The other proposal to calculate the error rate, rather than a fixed
> > > > > error boundary might be useful too, as if the current clocksource and
> > > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > > scale the error bounds to be within an given error rate.
> > > >
> > > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > > it usually ends up in a false alarm triggered by the  clock source
> > > > watchdog for tsc.
> > > >
> > > > I think Paul has a proposal to make a clock source watchdog to be more
> > > > intelligent.
> > > > Its job is to find a real problem instead of causing a problem.
> > >
> > > And that proposal is now in mainline:
> >
> > Great! : -)
> > >
> > > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > > db3a34e17433 clocksource: Retry clock read if long delays detected
> > >
> > > The strategy is to disqualify a clock comparison if the reads took too
> > > long, and to retry the comparison.  If four consecutive comparison take
> > > too long, clock skew is reported.  The number of consecutive comparisons
> > > may be adjusted by the usual kernel boot parameter.
> > >
> > > > so disabling it for known good-tsc might be a reasonable good idea
> > > > that can save manpower for other
> > > > more valuable problems to solve, or at least make it statistically a
> > > > problem less chance to happen.
> > >
> > > One additional piece that is still in prototype state in -rcu is to give
> > > clocksources some opportunity to resynchronize if there are too many
> > > slow comparisons.  This is intended to handle cases where clocks often
> >
> > if there is such tsc-sync algorithm existing in software, it really
> > can help system software engineers
> > to solve some rare power good signals synchronization problem caused
> > by bios that caused
> > boot time tsc sync check failure that usually would consume huge
> > debugging engine for bringing up qualified linux system.
> >
> > Less depending on platform quirks should be good thing to linux for
> > tsc && rcu support.
>
> Good point, I have procrastinated long enough.
>
> How about like this?

sorry, I meant a better algorithm to use tsc adjust register
like the tried one in  arch/x86/kernel/tsc_sync.c


>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
> Author: Paul E. McKenney <paulmck@...nel.org>
> Date:   Thu May 27 11:03:28 2021 -0700
>
>     clocksource: Forgive repeated long-latency watchdog clocksource reads
>
>     Currently, the clocksource watchdog reacts to repeated long-latency
>     clocksource reads by marking that clocksource unstable on the theory that
>     these long-latency reads are a sign of a serious problem.  And this theory

Maybe we need to use other core's tsc as a reference clock instead of
using HPET, which , to my knowledge ,  is the place where the problem happens.

Ruling out HPET and other slow clock devices as the obvious wrong choice
of a reference clock for  tsc,  I guess there will be less chance we
(in kernel code)
will get bothered by other latency problems perceived in the clock
source watchdog.

>     does in fact have real-world support in the form of firmware issues [1].
>
>     However, it is also possible to trigger this using stress-ng on what
>     the stress-ng man page terms "poorly designed hardware" [2].  And it
>     is not necessarily a bad thing for the kernel to diagnose cases where
>     heavy memory-contention workloads are being run on hardware that is not
>     designed for this sort of use.
>
>     Nevertheless, it is quite possible that real-world use will result in
>     some situation requiring that high-stress workloads run on hardware
>     not designed to accommodate them, and also requiring that the kernel
>     refrain from marking clocksources unstable.
>
>     Therefore, react to persistent long-latency reads by leaving the
>     clocksource alone, but using the old 62.5-millisecond skew-detection
>     threshold.  In addition, the offending clocksource is marked for
>     re-initialization, which both restarts that clocksource with a clean bill
>     of health and avoids false-positive skew reports on later watchdog checks.
>     Once marked for re-initialization, the clocksource is not subjected to
>     further watchdog checking until a subsequent successful read from that
>     clocksource that is free of excessive delays.
>
>     However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
>     attempts result in long latencies, a warning (splat) will be emitted.
>     This kernel boot parameter defaults to 100, and this warning can be
>     disabled by setting it to zero or to a negative value.
>
>     [ paulmck: Apply feedback from Chao Gao ]
>
>     Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
>     Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
>     Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
>     Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
>     Tested-by: Chao Gao <chao.gao@...el.com>
>     Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 316027c3aadc..61d2436ae9df 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -600,6 +600,14 @@
>                         loops can be debugged more effectively on production
>                         systems.
>
> +       clocksource.max_cswd_coarse_reads= [KNL]
> +                       Number of consecutive clocksource_watchdog()
> +                       coarse reads (that is, clocksource reads that
> +                       were unduly delayed) that are permitted before
> +                       the kernel complains (gently).  Set to a value
> +                       less than or equal to zero to suppress these
> +                       complaints.
> +
>         clocksource.max_cswd_read_retries= [KNL]
>                         Number of clocksource_watchdog() retries due to
>                         external delays before the clock will be marked
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 1d42d4b17327..3e925d9ffc31 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -110,6 +110,7 @@ struct clocksource {
>         int                     rating;
>         enum clocksource_ids    id;
>         enum vdso_clock_mode    vdso_clock_mode;
> +       unsigned int            n_coarse_reads;
>         unsigned long           flags;
>
>         int                     (*enable)(struct clocksource *cs);
> @@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
>  #define TIMER_ACPI_DECLARE(name, table_id, fn)         \
>         ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
>
> +extern int max_cswd_coarse_reads;
>  extern ulong max_cswd_read_retries;
>  void clocksource_verify_percpu(struct clocksource *cs);
>
> diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> index df922f49d171..7e82500c400b 100644
> --- a/kernel/time/clocksource-wdtest.c
> +++ b/kernel/time/clocksource-wdtest.c
> @@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
>                 else if (i <= max_cswd_read_retries)
>                         s = ", expect message";
>                 else
> -                       s = ", expect clock skew";
> +                       s = ", expect coarse-grained clock skew check and re-initialization";
>                 pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
>                 WRITE_ONCE(wdtest_ktime_read_ndelays, i);
>                 schedule_timeout_uninterruptible(2 * HZ);
>                 WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> -               WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> -                            !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> +               WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
>                 wdtest_ktime_clocksource_reset();
>         }
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2fb5ba..796a127aabb9 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
>         spin_unlock_irqrestore(&watchdog_lock, flags);
>  }
>
> +int max_cswd_coarse_reads = 100;
> +module_param(max_cswd_coarse_reads, int, 0644);
> +EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
>  ulong max_cswd_read_retries = 3;
>  module_param(max_cswd_read_retries, ulong, 0644);
>  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> @@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
>                                 pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
>                                         smp_processor_id(), watchdog->name, nretries);
>                         }
> -                       return true;
> +                       cs->n_coarse_reads = 0;
> +                       return false;
>                 }
> +               WARN_ONCE(max_cswd_coarse_reads > 0 &&
> +                         !(++cs->n_coarse_reads % max_cswd_coarse_reads),
> +                         "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
>         }
>
> -       pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> -               smp_processor_id(), watchdog->name, wd_delay, nretries);
> -       return false;
> +       if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> +       } else {
> +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> +       }
> +       return true;
>  }
>
>  static u64 csnow_mid;
> @@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>         int next_cpu, reset_pending;
>         int64_t wd_nsec, cs_nsec;
>         struct clocksource *cs;
> +       bool coarse;
>         u32 md;
>
>         spin_lock(&watchdog_lock);
> @@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
>                         continue;
>                 }
>
> -               if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> -                       /* Clock readout unreliable, so give it up. */
> -                       __clocksource_unstable(cs);
> -                       continue;
> -               }
> +               coarse = cs_watchdog_read(cs, &csnow, &wdnow);
>
>                 /* Clocksource initialized ? */
>                 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>                     atomic_read(&watchdog_reset_pending)) {
> -                       cs->flags |= CLOCK_SOURCE_WATCHDOG;
> +                       if (!coarse)
> +                               cs->flags |= CLOCK_SOURCE_WATCHDOG;
>                         cs->wd_last = wdnow;
>                         cs->cs_last = csnow;
>                         continue;
> @@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
>                         continue;
>
>                 /* Check the deviation from the watchdog clocksource. */
> -               md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> +               if (coarse) {
> +                       md = 62500 * NSEC_PER_USEC;
> +                       cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> +                       pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
> +               } else {
> +                       md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> +               }
>                 if (abs(cs_nsec - wd_nsec) > md) {
>                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
>                                 smp_processor_id(), cs->name);