linux-kernel - Re: clock jumps on dualcore PentiumD with cpufreq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <200712091551.59369.rjw@sisk.pl>
Date:	Sun, 9 Dec 2007 15:51:58 +0100
From:	"Rafael J. Wysocki" <rjw@...k.pl>
To:	Michael Tokarev <mjt@....msk.ru>
Cc:	Linux-kernel <linux-kernel@...r.kernel.org>,
	Dave Jones <davej@...hat.com>, Ingo Molnar <mingo@...e.hu>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: clock jumps on dualcore PentiumD with cpufreq

Added some CCs.

On Sunday, 9 of December 2007, Michael Tokarev wrote:
> This isn't a new issue, but so far no solution(s)
> has been found, it seems.  And with kernel development
> going on, the issue becomes worse.
> 
> Up to 2.6.20 or so (I don't remember exactly, but if
> I recall correctly the issue first appeared when new
> timer code has been merged), there was no issues at
> all.
> 
> Starting with 2.6.21 (at least), I see frequent messages
> from chronyd (http://chrony.sunsite.dk/), which was running
> on this machine since the day one, messages telling me
> that system clock has been reset - up to +/- several
> seconds.
> 
> On 2.6.22 there were only a few such messages per day,
> and the time delta was less than a second - in between
> 0.2-0.8 second usually.  On 2.6.23, it become much,
> much worse.  Here's a typical log from today:
> 
> ...
> Dec  9 08:10:50 paltus chronyd[1761]: System clock wrong by 0.664195 seconds, adjustment started
> Dec  9 08:31:06 paltus chronyd[1761]: System clock wrong by 0.666263 seconds, adjustment started
> Dec  9 09:10:35 paltus chronyd[1761]: System clock wrong by 0.830534 seconds, adjustment started
> Dec  9 09:47:56 paltus chronyd[1761]: System clock wrong by -0.632430 seconds, adjustment started
> Dec  9 10:11:24 paltus chronyd[1761]: System clock wrong by 0.557093 seconds, adjustment started
> Dec  9 10:30:36 paltus chronyd[1761]: System clock wrong by 0.931442 seconds, adjustment started
> Dec  9 10:31:40 paltus chronyd[1761]: System clock wrong by 0.918518 seconds, adjustment started
> Dec  9 10:32:44 paltus chronyd[1761]: System clock wrong by 0.919524 seconds, adjustment started
> Dec  9 10:33:48 paltus chronyd[1761]: System clock wrong by -1.783922 seconds, adjustment started
> Dec  9 10:50:53 paltus chronyd[1761]: System clock wrong by 0.815781 seconds, adjustment started
> Dec  9 11:18:37 paltus chronyd[1761]: System clock wrong by -1.120279 seconds, adjustment started
> Dec  9 11:48:30 paltus chronyd[1761]: System clock wrong by -0.576473 seconds, adjustment started
> Dec  9 11:50:38 paltus chronyd[1761]: System clock wrong by 0.728826 seconds, adjustment started
> Dec  9 11:52:46 paltus chronyd[1761]: System clock wrong by 0.561940 seconds, adjustment started
> Dec  9 11:57:02 paltus chronyd[1761]: System clock wrong by -0.786041 seconds, adjustment started
> Dec  9 12:18:22 paltus chronyd[1761]: System clock wrong by -0.559688 seconds, adjustment started
> Dec  9 12:20:30 paltus chronyd[1761]: System clock wrong by 0.712789 seconds, adjustment started
> Dec  9 12:26:55 paltus chronyd[1761]: System clock wrong by -0.511155 seconds, adjustment started
> Dec  9 12:40:47 paltus chronyd[1761]: System clock wrong by 0.721966 seconds, adjustment started
> Dec  9 12:47:11 paltus chronyd[1761]: System clock wrong by -1.053725 seconds, adjustment started
> Dec  9 13:01:03 paltus chronyd[1761]: System clock wrong by 0.921670 seconds, adjustment started
> Dec  9 13:21:20 paltus chronyd[1761]: System clock wrong by 0.948468 seconds, adjustment started
> ...
> 
> As far as I can guess, the problem is that the clock is
> different on different CPUs (or, rather, cores), so when
> the process (chronyd) gets moved from one core to another,
> it loses its minds and behaves like the above, unable to
> "understand" what's going on, and the result is that it
> speeds up the clock on a core where it should slow the
> clock down and vise versa, resulting it a mess.
> 
> The machine is an IBM xSeries 3200, single CPU dual-core
> PentiumD processor:
> 
> vendor_id	: GenuineIntel
> cpu family	: 15
> model		: 6
> model name	: Intel(R) Pentium(R) D CPU 2.80GHz
> stepping	: 4
> cpu MHz		: 2800.000
> cache size	: 2048 KB
> 
> The system says that TSC clocksource is unstable and switches
> to jiffies:
> 
> Clocksource tsc unstable (delta = 70021228 ns)
> Time: jiffies clocksource has been installed.
> 
> There are two things I noticed so far.
> 
> First, this problem is quite strongly related to cpufreq.
> Namely, without cpufreq scaling (with the default "performance"
> governor), everything works correctly.  After switching to
> "ondemand" governor, chronyd goes mad as shown above.  And
> after switching back to "performance", after some time (needed
> for the madness to slowly stop), the problem vanishes again.
> By the way, I'm not exactly sure but it seems that the above
> message (about TSC being unstable) is triggered right when
> switching to "ondemand" governor.
> 
> And second, I've only seen the whole problem with 32bits
> kernel.  x86-64 kernel (with the exact same config) does
> not expose it in any way, be it cpufreq or not.
> 
> This last point is the very interesting one, because it
> suggests that the bug is in kernel somewhere, more, that
> it's in i386 part of it.
> 
> Sadly I've only a limited testing abilities on this machine,
> as it's our main production (file) server (so I can only
> test something on it during weekends and late evenings).
> 
> Similar (but at much smaller extent/scale) behavior has been
> observed on several machines with AMD X2-64 processors as well,
> also related to cpufreq, but this time both 64 and 32 bits
> kernels behaves the same, and it seems this is a different
> issue.
> 
> Any recommendations for me to try to debug it further, please?..
> 
> Thanks.
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/