linux-kernel - Re: Hangs and reboots under high loads, oops with DEBUG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46AE0F66.7000504@intel.com>
Date:	Mon, 30 Jul 2007 09:18:46 -0700
From:	"Kok, Auke" <auke-jan.h.kok@...el.com>
To:	Attila Nagy <bra@....hu>
CC:	linux-kernel@...r.kernel.org
Subject: Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

Attila Nagy wrote:
> Hello,
> 
> I have four identical machines, based on Supermicro X7DBE motherboards.
> All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on
> cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA.
> 
> I would like to use these as file servers (via FC), but during the 
> performance and
> reliabilty tests it turned out that the machines are very unreliable, 
> despite that they
> seemed to be OK hardware-wise (memtest and the usual stuff).
> 
> During the debugging of this (seemingly) high IO load related problem, I 
> have
> observed the following:
> - when MSI is enabled (the first iteration), the machines sometimes 
> "hang", but
> not the whole system, just the SCSI target subsystem (SCST), which makes
> heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs
> - when MSI is disabled, I couldn't reproduce that hung up state, instead the
> machines sometimes throw an MCE (see below), but I couldn't find its cause
> - when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines
> can't even boot normally, I get an oops instantly during the kernel 
> initialization
> - with MSI disabled sometimes the machines fail to respond, the ssh 
> sessions terminate
> and on the console I can't type for very long seconds. I have nearly all 
> debugging turned on,
> but can't see anything in the logs or on the console. The machine 
> recovers from this hang
> automatically. The whole thing seems like when a high (eg. network) 
> interrupt activity happens
> on a highly loaded machine, but I could observe this even after a fresh 
> boot, without anything
> (of course minus the standard stuff, sshd, and the others) running on 
> the machine.
> 
> The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), 
> running in 64 bit mode.

something is definately not happy on this system. There was a e1000 fix related 
to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 
immediately - however:

> The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled:
> 
> [   92.681320] NET: Registered protocol family 17
> [   93.491658] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> [   93.557402]  [<0000000000000000>]
> [   93.626770] PGD 0
> [   93.651106] Oops: 0010 [1] SMP
> [   93.689170] CPU 1
> [   93.713506] Modules linked in:
> [   93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1
> [   93.815011] RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
> [   93.887187] RSP: 0018:ffff81042fc5dc68  EFLAGS: 00010002
> [   93.950836] RAX: ffff81042fbe6b70 RBX: 0000000000000202 RCX: ffff81042fbe6b70
> [   94.036323] RDX: ffffc20000040000 RSI: ffff81042f51cdf8 RDI: ffff81042fbe6800
> [   94.121812] RBP: ffff81042fc5dd10 R08: 0000000000000000 R09: ffff81042f4c0ea8
> [   94.207298] R10: 0000000000000000 R11: ffff81042fbe6800 R12: 00000000fffffff4
> [   94.292788] R13: ffff81042fbe6000 R14: 0000000000000001 R15: ffffffff80399450
> [   94.378275] FS:  0000000000000000(0000) GS:ffff81042fc694c8(0000) knlGS:0000000000000000
> [   94.475307] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   94.544153] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> [   94.629643] Process swapper (pid: 1, threadinfo ffff81042fc5c000, task ffff81042fc58040)
> [   94.726673] Stack:  ffffffff80399559 ffff81042fc5dca0 ffffffff802121b2 ffff81042f4c0ea0
> [   94.823603]  ffff81042fc5dca0 ffffffff80221cd7 ffffffff802addb5 ffff81042fbe6800
> [   94.913042]  ffffffff8020c9bc ffff81042fbe6b70 0000000000000246 ffff81042fbe6b70
> [   95.000194] Call Trace:
> [   95.031814]  [<ffffffff80399559>] e1000_intr+0x109/0x590
> [   95.095461]  [<ffffffff802121b2>] poison_obj+0x42/0x60
> [   95.157027]  [<ffffffff80221cd7>] dbg_redzone1+0x17/0x30
> [   95.220676]  [<ffffffff802addb5>] request_irq+0x95/0x150
> [   95.284324]  [<ffffffff8020c9bc>] cache_alloc_debugcheck_after+0x17c/0x1c0
> [   95.366690]  [<ffffffff8020a43d>] kmem_cache_alloc+0xcd/0xf0
> [   95.434500]  [<ffffffff80399450>] e1000_intr+0x0/0x590
> [   95.496067]  [<ffffffff802ade00>] request_irq+0xe0/0x150
> [   95.559716]  [<ffffffff8039558c>] e1000_request_irq+0x3c/0x80
> [   95.628564]  [<ffffffff803985bc>] e1000_open+0x5c/0x100
> [   95.691172]  [<ffffffff8041d937>] dev_open+0x37/0x80
> [   95.750661]  [<ffffffff8041becd>] dev_change_flags+0x6d/0x150
> [   95.819508]  [<ffffffff80616565>] ip_auto_config+0x175/0xea0
> [   95.887317]  [<ffffffff80442f88>] tcp_set_default_congestion_control+0x18/0x70
> [   95.973947]  [<ffffffff80442fcf>] tcp_set_default_congestion_control+0x5f/0x70
> [   96.060582]  [<ffffffff80265236>] _spin_unlock+0x26/0x30
> [   96.124227]  [<ffffffff805f1754>] init+0x1a4/0x2b0
> [   96.181635]  [<ffffffff802a0e7b>] trace_hardirqs_on+0x14b/0x180
> [   96.252563]  [<ffffffff8025ff28>] child_rip+0xa/0x12
> [   96.312051]  [<ffffffff8026563b>] _spin_unlock_irq+0x2b/0x40
> [   96.379859]  [<ffffffff8025f63c>] restore_args+0x0/0x30
> [   96.442467]  [<ffffffff805f15b0>] init+0x0/0x2b0
> [   96.497795]  [<ffffffff8025ff1e>] child_rip+0x0/0x12
> [   96.557282]
> [   96.575170]
> [   96.575171] Code:  Bad RIP value.
> [   96.633203] RIP  [<0000000000000000>]
> [   96.677297]  RSP <ffff81042fc5dc68>
> [   96.719105] CR2: 0000000000000000
> [   96.758835] Kernel panic - not syncing: Attempted to kill init!
> 
> 
> MCE:
> [153103.918654] HARDWARE ERROR
> [153103.918655] CPU 1: Machine Check Exception:                5 Bank 0: 
> b200004010000400
> [153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.145699] TSC 1167e915e93ce
> [153104.183554] This is not a software problem!
> [153104.234724] Run through mcelog --ascii to decode and contact your 
> hardware vendor
> [153104.325517]
> [153104.325518] HARDWARE ERROR
> [153104.325519] CPU 1: Machine Check Exception:                5 Bank 5: 
> b200221024080400
> [153104.472883] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.552546] TSC 1167e915e9ea8
> [153104.590402] This is not a software problem!
> [153104.641572] Run through mcelog --ascii to decode and contact your 
> hardware vendor
> [153104.732365] Kernel panic - not syncing: Machine check

this is serious problems that might not be resolved and be a symptom of a true 
hardware issue. Looking at the time it seems an issue on itself and unrelated to 
the e1000 debug_shirq fix.

> I've got exactly the same errors (only the TSC and the CPU value 
> changing) on all four machines,
> could this really be a hardware error?

yes

> full dmesg: http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_debug_shirq
> dmesg with MSI enabled: 
> http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_msi_and_debug_shirq
> kernel config: http://people.fsn.hu/~bra/linux/x7dbe-20070730/config
> 
> I've tried to disable all possible devices which consume interrupts,
> and placed the cards into various slots (the FC HBA is PCI-X, which the 
> Areca
> is PCI-E). Currently the arcmsr and (one of, it's a dual channel
> HBA) qla2xxx are on a shared IRQ.
> 
> Could you please help?
> Do you think this is related to the strange hang under high IO load, the
> occasional, complete "blackouts", where all ssh network sessions
> time out, but the machine recovers, and the MCEs?


Like I said, please try 2.6.22 which should have the e1000 issue fixed. The MCE 
looks real and a different problem

Auke
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/