linux-kernel - Re: 2.6.32.21 - uptime related crashes?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110514190423.GA2264@nik-comp.lan>
Date:	Sat, 14 May 2011 21:04:23 +0200
From:	Nikola Ciprich <nikola.ciprich@...uxbox.cz>
To:	Willy Tarreau <w@....eu>
Cc:	Faidon Liambotis <paravoid@...ian.org>,
	linux-kernel@...r.kernel.org, stable@...nel.org,
	seto.hidetoshi@...fujitsu.com,
	Hervé Commowick <hcommowick@...sec.fr>,
	Randy Dunlap <rdunlap@...otime.net>, Greg KH <greg@...ah.com>,
	Ben Hutchings <ben@...adent.org.uk>,
	Apollon Oikonomopoulos <apoikos@...il.com>, chronidev@...il.com
Subject: Re: 2.6.32.21 - uptime related crashes?

Hello gentlemans,
Nicolas, thanks for further report, it contradicts my theory that problem occured somewhere during 2.6.32.16.
Now I think I know why several of my other machines running 2.6.32.x for long time didn't crashed:

I checked bugzilla entry for (I believe the same) problem here:
https://bugzilla.kernel.org/show_bug.cgi?id=16991
and Peter Zijlstra asked there, whether reporters systems were running some RT tasks. Then I realised that all of my four crashed boxes were pacemaker/corosync clusters and pacemaker uses lots of RT priority tasks. So I believe this is important, and might be reason why other machines seem to be running rock solid - they are not running any RT tasks.
It also might help with hunting this bug. Is somebody of You also running some RT priority tasks on inflicted systems, or problem also occured without it?
Cheers!
n.

PS: Hidetoshi-san - btw, (late) thanks for Your reply and confirmation that Your patch should not be the cause of this problem. I'm now sure it must have emerged sooner, and I'm sorry for this false accusation :)

On Sat, Apr 30, 2011 at 10:14:36PM +0200, Willy Tarreau wrote:
> Hi Faidon,
> 
> On Sat, Apr 30, 2011 at 08:39:05PM +0300, Faidon Liambotis wrote:
> > If I may add some, hopefully useful, information to the thread:
> > 
> > At work we have an HP c7000 blade enclosure. It's populated with 8 ProLiant
> > BL460c G1 (Xeon E5405, constant_tsc but not nonstop_tsc) and 4 ProLiant 
> > BL460c
> > G6 (Xeon E5504, constant_tsc + nonstop_tsc). All were booted at the same day
> > with Debian's 2.6.32-23~bpo50+1 kernel, i.e. upstream 2.6.32.21.
> > 
> > We too experienced problems with just the G6 blades at near 215 days 
> > uptime (on the 19th of April), all at the same time. From our 
> > investigation, it seems that their cpu_clocks jumped suddenly far in the 
> > future and then almost immediately rolled over due to wrapping around 
> > 64-bits.
> > 
> > Although all of their (G6s) clocks wrapped around *at the same time*, only 
> > one
> > of them actually crashed at the time, with a second one crashing just a few
> > days later, on the 28th.
> > 
> > Three of them had the following on their logs:
> > Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers 
> > present
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4 
> > stuck for 17163091968s! [kvm:25913]
> > [...]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447275]  [<ffffffff81010b42>] ? 
> > system_call_fastpath+0x16/0x1b
> > Apr 19 10:18:32 hn-05 kernel: [   31.587025] bond0.13: received packet with 
> > own address as source address
> > (full oops at the end of this mail)
> > 
> > Note that the 17163091968s time stuck was *the same* (+/- 1s) in *all
> > three of them*.
> > 
> > What's also very strange, although we're not very sure if related, is
> > that when we took the first line of the above log entry and substracted
> > 17966378.581971 from Apr 18 20:56:07, this resulted in a boot-time that
> > differed several hours from the actual boot time (the latter being
> > verified both with syslog and /proc/bstat's btime, which were both in
> > agreement). This was verified post-mortem too, with the date checked to
> > be correct.
> 
> Well, your report is by far the most complete of our 3 since you managed
> to get this trace of wraping time.
> 
> I don't know if it means anything to anyone, but 17163091968 is exactly
> 0x3FF000000, or 1023*2^24. Too round to be a coincidence. This number
> of seconds corresponds to 1000 * 2^32 jiffies at 250 Hz.
> 
> > IOW, we have some serious clock drift (calcuated it at runtime to
> > ~0.1s/min) on these machines that hasn't been made apparent probably
> > since they all run NTP. Moreover, we also saw that drift on other
> > machines (1U, different vendor, different data center, E5520 CPUs) but
> > not with the G1 blades. Our investigation showed that the drift is there
> > constantly (it's not a one-time event) but we're not really sure if it's
> > related with the Apr 18th jump event.  Note that the drift is there even
> > if we boot with "clocksource=hpet" but disappears when booted with
> > "notsc". Also note that we've verified with 2.6.38 and the drift is
> > still there.
> > 
> > Regards,
> > Faidon
> > 
> > 1: The full backtrace is:
> > Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers 
> > present
> > Apr 19 06:25:02 hn-05 kernel: imklog 3.18.6, log source = /proc/kmsg 
> > started.
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4 
> > stuck for 17163091968s! [kvm:25913]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447056] Modules linked in: 
> > xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter 
> > ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> > defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q 
> > garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff 
> > ipmi_devintf loop snd_pcm snd_ti
> > mer snd soundcore bnx2x psmouse snd_page_alloc serio_raw sd_mod crc_t10dif 
> > hpilo ipmi_si ipmi_msghandler container evdev crc32c pcspkr shpchp 
> > pci_hotplug libcrc32c power_meter mdio
> > button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log 
> > dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx 
> > scsi_transport_fc usbcore nls_base tg3 libphy scsi_
> > tgt scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447111] CPU 4:
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447112] Modules linked in: 
> > xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter 
> > ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> > defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q 
> > garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff 
> > ipmi_devintf loop snd_pcm snd_timer snd soundcore bnx2x psmouse 
> > snd_page_alloc serio_raw sd_mod crc_t10dif hpilo ipmi_si ipmi_msghandler 
> > container evdev crc32c pcspkr shpchp pci_hotplug libcrc32c power_meter mdio 
> > button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log 
> > dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx 
> > scsi_transport_fc usbcore nls_base tg3 libphy scsi_tgt scsi_mod thermal fan 
> > thermal_sys [last unloaded: scsi_wait_scan]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447154] Pid: 25913, comm: kvm 
> > Not tainted 2.6.32-bpo.5-amd64 #1 ProLiant BL460c G6
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447157] RIP: 
> > 0010:[<ffffffffa02e209a>]  [<ffffffffa02e209a>] 
> > kvm_arch_vcpu_ioctl_run+0x785/0xa44 [kvm]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447177] RSP: 
> > 0018:ffff88070065fd38  EFLAGS: 00000202
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447179] RAX: ffff88070065ffd8 
> > RBX: ffff8804154ba860 RCX: ffff8804154ba860
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447182] RDX: ffff8804154ba8b9 
> > RSI: ffff81004c818b10 RDI: ffff8100291a7d48
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447184] RBP: ffffffff8101166e 
> > R08: 0000000000000000 R09: 0000000000000000
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447187] R10: 00007f1b2a2af078 
> > R11: ffffffff802f3405 R12: 0000000000000001
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447189] R13: 00000000154ba8b8 
> > R14: ffff8804136ac000 R15: ffff8804154bbd48
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447192] FS:  
> > 0000000042cbb950(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447195] CS:  0010 DS: 002b ES: 
> > 002b CR0: 0000000080050033
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447197] CR2: 0000000000000008 
> > CR3: 0000000260c41000 CR4: 00000000000026e0
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447200] DR0: 0000000000000000 
> > DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447202] DR3: 0000000000000000 
> > DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447205] Call Trace:
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447215]  [<ffffffffa02e2004>] ? 
> > kvm_arch_vcpu_ioctl_run+0x6ef/0xa44 [kvm]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447224]  [<ffffffff8100f79c>] ? 
> > __switch_to+0x285/0x297
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447231]  [<ffffffffa02d49d1>] ? 
> > kvm_vcpu_ioctl+0xf1/0x4e6 [kvm]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447237]  [<ffffffff810240da>] ? 
> > lapic_next_event+0x18/0x1d
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447245]  [<ffffffff8106fb77>] ? 
> > tick_dev_program_event+0x2d/0x95
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447251]  [<ffffffff81047e29>] ? 
> > finish_task_switch+0x3a/0xaf
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447258]  [<ffffffff810f9f5a>] ? 
> > vfs_ioctl+0x21/0x6c
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447261]  [<ffffffff810fa4a8>] ? 
> > do_vfs_ioctl+0x48d/0x4cb
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447268]  [<ffffffff81063fcd>] ? 
> > sys_timer_settime+0x233/0x283
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447272]  [<ffffffff810fa537>] ? 
> > sys_ioctl+0x51/0x70
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447275]  [<ffffffff81010b42>] ? 
> > system_call_fastpath+0x16/0x1b
> > Apr 19 10:18:32 hn-05 kernel: [   31.587025] bond0.13: received packet with 
> > own address as source address
> 
> Regards,
> Willy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@...uxbox.cz
-------------------------------------

Content of type "application/pgp-signature" skipped