[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230111175056.GW4028633@paulmck-ThinkPad-P17-Gen-1>
Date: Wed, 11 Jan 2023 09:50:56 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org, john.stultz@...aro.org,
sboyd@...nel.org, corbet@....net, Mark.Rutland@....com,
maz@...nel.org, kernel-team@...a.com, neeraju@...eaurora.org,
ak@...ux.intel.com, feng.tang@...el.com, zhengjun.xing@...el.com,
Waiman Long <longman@...hat.com>,
John Stultz <jstultz@...gle.com>
Subject: Re: [PATCH clocksource 5/6] clocksource: Suspend the watchdog
temporarily when high read latency detected
On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote:
> > This can be reproduced by running memory intensive 'stream' tests,
> > or some of the stress-ng subcases such as 'ioport'.
> >
> > The reason for these issues is the when system is under heavy load, the
> > read latency of the clocksources can be very high. Even lightweight TSC
> > reads can show high latencies, and latencies are much worse for external
> > clocksources such as HPET or the APIC PM timer. These latencies can
> > result in false-positive clocksource-unstable determinations.
> >
> > Given that the clocksource watchdog is a continual diagnostic check with
> > frequency of twice a second, there is no need to rush it when the system
> > is under heavy load. Therefore, when high clocksource read latencies
> > are detected, suspend the watchdog timer for 5 minutes.
>
> We should have enough heuristics in place by now to qualify the output of
> the clocksource watchdog as a random number generator, right?
Glad to see that you are still keeping up your style, Thomas! ;-)
We really do see the occasional clocksource skew in our fleet, and
sometimes it really is the TSC that is in disagreement with atomic-clock
time. And the watchdog does detect these, for example, the 40,000
parts-per-million case discussed recently. We therefore need a way to
check this, but without producing false positives on busy systems and
without the current kneejerk reflex of disabling TSC, thus rendering the
system useless from a performance standpoint for some important workloads.
Yes, if a system was 100% busy forever, this patch would suppress these
checks. But 100% busy forever is not the common case, due to thermal
throttling and to security updates if nothing else.
With all that said, is there a better way to get the desired effects of
this patch?
Thanx, Paul
Powered by blists - more mailing lists