[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <4A8AAE44.5030008@lih.rwth-aachen.de>
Date: Tue, 18 Aug 2009 15:36:04 +0200
From: Victor Mataré <matare@....rwth-aachen.de>
To: john stultz <johnstul@...ibm.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: clock freezes??
john stultz wrote:
> On Tue, 2009-08-11 at 17:39 +0200, Victor Mataré wrote:
>> john stultz wrote:
>>> On Fri, Jul 24, 2009 at 7:07 AM, Victor Mataré<matare@....rwth-aachen.de> wrote:
>>>> I have a dual Xeon server (old Xeon HT) with an Intel E7505 chipset,
>>>> with hrtimer and dynticks enabled. On bootup, the kernel
>>>> (2.6.29-gentoo-r5) tells me it's using the PM-Timer bug workaround, but
>>>> then it uses tsc as clocksource. Now the clock was running slow for
>>>> about 15sec/12hrs, which is quite a lot. So in a careless moment, I just
>>>> tried "echo jiffies > clocksource0/current_clocksource". This froze the
>>>> system time. Now I couldn't switch back to tsc or acpi_pm, echoing those
>>>> was just ignored. Subsequently, the entire system locked up and I needed
>>>> to reboot.
>>>>
>>>> Now what does that mean? Is this supposed to happen? Should I disable
>>>> dynticks and/or hrtimer?
>>> The system lockup is a known issue and should be resolved with the
>>> following commit:
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3f68535adad8dd89499505a65fb25d0e02d118cc
>>>
>>> I might be curious if you could expand a bit more about the clock skew
>>> (15sec per 12 hours) you're seeing. Are you running NTP? Do you have
>>> the output of ntpdc -c kerninfo , ntpdc -c peers? Do you see lots of
>>> ntp messages in /var/log/messages or /var/log/syslog ?
>>>
>>> thanks
>>> -john
>> Until now, I was just using BSD netdate, which kept adding 12-25 seconds
>> every 24 hours.
>> The whole issue is related to strange lockups I had been seeing about
>> monthly, apparently everytime the clock was rewound instead of put
>> forward (clock freezes, programs hang, system ends up deadlocked within
>> 10-300 minutes depending on usage). The system is a production
>> fileserver acting mainly as a Samba PDC, so testing this scenario is
>> quite difficult. Now recently, I swapped the motherboard including RAM
>> and CPU with our webserver, which seems to have removed the monthly
>> time-freeze, but led to the above-mentioned freeze caused by me
>> experimenting with clocksource=jiffies because of the slow clock.
>> However, the monthly freezes upon rewinding the clock may be gone now
>> just because the clock is consistently running slow, so it doesn't need
>> to be rewound any more.
>> I've just switched both systems to ntpd:
>>
>> # ntpdc -c kerninfo
>> pll offset: 0 s
>> pll frequency: 0.000 ppm
>> maximum error: 16 s
>> estimated error: 16 s
>> status: 0040 unsync
>> pll time constant: 4
>> precision: 1e-06 s
>> frequency tolerance: 500 ppm
>
> So here its reporting you're still in unsync mode, so ntp hasn't run
> long enough to do any corrections. You might let the system run with
> ntpd for a few hours (or a day) and then try re-try the "ntpdc -c
> kerninfo" command.
>
>
>> # ntpdc -c peers -n
>> remote local st poll reach delay offset disp
>> =======================================================================
>> =134.130.4.17 137.226.164.2 1 64 377 0.00061 0.156253 0.03084
>> *134.130.5.17 137.226.164.2 1 64 377 0.00035 0.205312 0.03041
>>
>> Dunno how to interpret that. Syslog now gives:
>>
>> Aug 11 17:13:02 bussard ntpd[21845]: ntpd 4.2.4p7@...607-o Tue Jun 23
>> 10:58:51 UTC 2009 (1)
>> Aug 11 17:13:02 bussard ntpd[21874]: precision = 1.000 usec
>> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #0 wildcard,
>> 0.0.0.0#123 Disabled
>> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #1 lo,
>> 127.0.0.1#123 Enabled
>> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #2 eth0,
>> 137.226.164.2#123 Enabled
>> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #3 eth0:1,
>> 192.168.23.3#123 Enabled
>> Aug 11 17:13:02 bussard ntpd[21874]: kernel time sync status 0040
>> ...
>> Aug 11 17:16:18 bussard ntpd[21874]: synchronized to 134.130.4.17, stratum 1
>> Aug 11 17:16:32 bussard ntpd[21874]: time reset +13.979355 s
>> ...
>> Aug 11 17:20:43 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
>
>
>
> I'd also be interested in seeing the syslog data as well after a day or
> so of running ntpd.
>
>> However, the issue of the clock freezing upon time-rewind still remains
>> quite unclear to me. Can it be caused by the careless way in which
>> netdate does it? Can it be related to the jiffies-hrtimer issue?
>
> I'm not familiar with netdate, however the time should be able to be set
> in either forward or backward direction without the system hanging.
>
> Can you trigger the problem yourself using date -s ?
>
> thanks
> -john
>
OK, ntpd has been running for a week now, and I'm not sure what it has
been doing. Last week, on the day I set it up, there was this:
# grep ntpd\\[ /var/log/syslog
Aug 11 17:16:18 bussard ntpd[21874]: synchronized to 134.130.4.17, stratum 1
Aug 11 17:16:32 bussard ntpd[21874]: time reset +13.979355 s
Aug 11 17:20:43 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 11 17:32:31 bussard ntpd[21874]: time reset +0.300276 s
Aug 11 17:32:31 bussard ntpd[21874]: kernel time sync status change 0001
Aug 11 17:36:27 bussard ntpd[21874]: synchronized to 134.130.4.17, stratum 1
Aug 11 17:40:56 bussard ntpd[21874]: no servers reachable
Aug 11 18:32:16 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 11 18:32:16 bussard ntpd[21874]: kernel time sync status change 4001
Now after that, up until today, I see only this:
Aug 17 18:43:42 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 17 18:50:28 bussard ntpd[21874]: kernel time sync status change 0001
Aug 17 19:17:50 bussard ntpd[21874]: no servers reachable
Aug 17 20:09:06 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 17 20:09:06 bussard ntpd[21874]: kernel time sync status change 4001
Aug 17 20:32:56 bussard ntpd[21874]: kernel time sync status change 0001
Aug 17 20:43:17 bussard ntpd[21874]: no servers reachable
Aug 17 23:06:35 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 17 23:06:35 bussard ntpd[21874]: kernel time sync status change 4001
Aug 17 23:16:53 bussard ntpd[21874]: no servers reachable
Aug 18 11:48:13 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 18 12:05:18 bussard ntpd[21874]: no servers reachable
Aug 18 14:11:32 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1
Aug 18 14:21:57 bussard ntpd[21874]: no servers reachable
# grep ntpd\\[ /var/log/syslog | wc
177 1737 12413
(Starting from August 11, when I set up ntpd)
# ntpdc -c kerninfo
pll offset: -0.002368 s
pll frequency: 272.989 ppm
maximum error: 1.51074 s
estimated error: 0.003472 s
status: 4001 pll
pll time constant: 10
precision: 1e-06 s
frequency tolerance: 500 ppm
Now I guess it does adjust the time quite frequently, although it
doesn't tell me the differences in syslog, right? At least all servers
are nicely synchronized in time, now.
However, I've never been able to manually trigger that regular
time-freeze that occurred about monthly. Note that above data are from
the system that did not experience that monthly freeze, but did freeze
on the "echo jiffies > current_clocksource" thing.
On the system that did freeze spontaneously syslog looks just about the
same, with:
# grep "ntpd\[" syslog | wc
199 1934 13332
and:
# ntpdc -c kerninfo
pll offset: -0.055492 s
pll frequency: -164.169 ppm
maximum error: 1.14918 s
estimated error: 0.018156 s
status: 4001 pll
pll time constant: 10
precision: 1e-06 s
frequency tolerance: 500 ppm
Now I might be inclined to account the apparently higher deviation to
the higher CPU clock this system uses (2x Xeon HT 3.06 GHz on Intel
E7501 vs. 2x Xeon HT 2.4 GHz on Intel E7505 in the first system. Both
have clocksource=tsc).
I'm sorry for the long delay, but my working time in this place is very
limited. I'd really appreciate some ideas on how I might:
- Narrow in on the problem once I get so lucky as to witness the machine
freezing...
- Trigger the timer freeze with something other than the jiffies thing
Both systems have now been up for 20, resp. 25 days, so the monthly
freeze (if it does occur now after swapping the mainboards) is to be
expected within the next 10 days.
Great thanks for your help and interest in this problem.
--
Victor Mataré
Server- & Netzwerk-Administration
Lehrstuhl für Ingenieurgeologie und Hydrogeologie der RWTH-Aachen
Lochnerstraße 4-20, 52064 Aachen
Tel: 0241 80 96778
Fax: 0241 80 92280
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists