linux-kernel - Re: Solid freezes with 2.6.25

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20080428092513.495378af.akpm@linux-foundation.org>
Date:	Mon, 28 Apr 2008 09:25:13 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Gabor Gombas <gombasg@...aki.hu>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Thomas Gleixner <tglx@...utronix.de>,
	Bernhard Walle <bwalle@...e.de>
Subject: Re: Solid freezes with 2.6.25

On Mon, 28 Apr 2008 16:29:35 +0200 Gabor Gombas <gombasg@...aki.hu> wrote:

> Hi,
> 
> I'm seeing solid freezes with 2.6.25. 2.6.24.x works fine, 2.6.25 never
> had an uptime longer than 4-6 hours so far. netconsole captured the
> following:
> 
> NMI Watchdog detected LOCKUP on CPU 1
> CPU 1 
> Modules linked in: edd netconsole configfs i915 radeon drm rfcomm l2cap bluetooth xfrm_user xfrm4_tunnel tunnel4 ipcomp esp4 aead ah4 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ipt_ULOG microcode ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables deflate zlib_deflate zlib_inflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_x86_64 aes_generic xcbc sha256_generic sha1_generic md5 crypto_null af_key fuse dm_crypt crypto_blkcipher dm_snapshot dm_mirror dm_mod coretemp w83627ehf hwmon_vid snd_hda_intel snd_pcm 8250_pnp snd_timer 8250 sg snd 8139too serial_core video r8169 snd_page_alloc usbhid i2c_i801 sr_mod iTCO_wdt floppy cdrom [last unloaded: netconsole]
> Pid: 2535, comm: postgres Not tainted 2.6.25 #11
> RIP: 0010:[<ffffffff8021aa54>]  [<ffffffff8021aa54>] hpet_rtc_interrupt+0x11a/0x2fd
> RSP: 0000:ffff81012fc77ec8  EFLAGS: 00200097
> RAX: 0000000000000000 RBX: 0000000000200002 RCX: 0000000000000000
> RDX: 000000000000c6c6 RSI: 0000000000200002 RDI: ffffffff80655ef8
> RBP: 000000010011144c R08: ffffffffff5fc128 R09: 0000000000000000
> R10: 0000000000200046 R11: 0000000000000000 R12: 00000000000000a6
> R13: ffff81012fcf8800 R14: 0000000000000000 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff81012fc0f480(0063) knlGS:00000000f7f228e0
> CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> CR2: 00000000f1559000 CR3: 0000000128cd8000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process postgres (pid: 2535, threadinfo ffff810128d18000, task ffff81012cbb6930)
> Stack:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
>  ffffffff00000000 0000000000000001 ffffffff806432c0 ffff81012fe25bc0
>  0000000000000000 0000000000000000 0000000000000008 ffffffff8025d6d0
> Call Trace:
>  <IRQ>  [<ffffffff8025d6d0>] ? handle_IRQ_event+0x25/0x53
>  [<ffffffff8025ec3a>] ? handle_edge_irq+0xdd/0x11c
>  [<ffffffff8020c0cc>] ? call_softirq+0x1c/0x28
>  [<ffffffff8020e26a>] ? do_IRQ+0xf1/0x15f
>  [<ffffffff8020b451>] ? ret_from_intr+0x0/0xa
>  <EOI> 
> 
> Code: a0 28 00 bf 0a 00 00 00 48 89 c3 e8 73 6b ff ff 48 89 de 41 88 c4 48 c7 c7 f8 5e 65 80 e8 14 a1 28 00 45 84 e4 78 04 eb 12 f3 90 <48> 8b 05 25 1e 3e 00 48 29 e8 48 83 f8 04 76 ee 48 c7 c7 f8 5e 
> ---[ end trace 8625c90c6582673f ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!
> 
> Also, I have these messages in syslog:
> 
> Apr 28 13:13:31 boogie kernel: rtc: lost 157 interrupts
> Apr 28 13:13:32 boogie kernel: rtc: lost 37 interrupts
> Apr 28 13:25:37 boogie kernel: rtc: lost 60 interrupts
> 
> More info about the machine is attached. I've also seen similar hangs with
> 2.6.25-rc6 on an nforce4/Athlon64 box but I'm reluctant to re-test there
> because RAID rebuild takes too long.

I don't see any loop in hpet_rtc_interrupt() which can lock up so I assume
that for some reason we stop clearing the interrupt source and we
continuously reenter the interrupt handler.

I think this could also happen if someone runs
hpet_unregister_irq_handler() while the hpet is still active.

Ugly.  If it was sanely reproducible then you could perhaps bisect it, but
two hours makes that unfeasible :(

Suspicion would have to be directed at the 2.6.25 CONFIG_HPET_EMULATE_RTC
changes.

I think our best bet here would be to persuade someone who knows what's
going on in there to prepare a debugging patch for you to run with
(please).  See if we can find out what the code is doing at the time when
it freezes up.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/