[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAOgv-zi74c9DPm8rQc2zYE=xLyMCU18g845oT02TeP5oZsJaQA@mail.gmail.com>
Date: Mon, 27 Feb 2012 08:38:05 -0600
From: Gerard Saraber <gsaraber@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-kernel@...r.kernel.org, xfs@....sgi.com
Subject: Re: Soft lockup problem
Hi everyone,
It finally did it again, it took longer then I expected, it also
locked itself up so bad that I couldn't get into it to hit
ctrl+alt+sysrq+w ..
I had turned on the debugging feature that automatically logs the hung
tasks, and I've attached the log below, I hope it's helpful.
I was running 3.2.4 from kernel.org on a 4 core Xeon machine:
model name : Intel(R) Xeon(R) CPU 5140 @ 2.33GHz
6GB Ram
2x Intel 80003ES2LAN Gigabit Ethernet Controllers bonded together
2 LSI SAS controllers:
08:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)
0a:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
2008 [Falcon] (rev 03)
16 drives in a mix of 2 and 3TB, in 3 raid5 arrays and combined
together with LVM
/dev/mapper/pool-main 23T 12T 11T 52%
for a 23TB volume formatted with XFS.
The root partition is ext4 on an older SATA drive, the reason I bring
this up is that when I hit (on a whim) ctrl+sysrq+J that is supposed
to unfreeze frozen filesystems, the console started dumping lots of
messages about attempting to unfreeze /dev/sda3 [my root partition] so
maybe there's a problem with my sda drive.
But I get no i/o or other errors in my logs at all. I monitor all
drives with smartd to head off any drive failures before they happen
and it seems to think sda is fine.
Hopefully my attached log helps.
I appreciate any input, also please call me an idiot if I'm missing
something obvious.
-Gerard Saraber
On Tue, Feb 7, 2012 at 10:54 AM, Jan Kara <jack@...e.cz> wrote:
> On Tue 07-02-12 10:35:37, Gerard Saraber wrote:
>> On Mon, Feb 6, 2012 at 4:51 PM, Jan Kara <jack@...e.cz> wrote:
>> > On Mon 06-02-12 09:40:45, Gerard Saraber wrote:
>> >> Greetings everyone,
>> >> I've been having a bit of a problem since upgrading to the linux 3.x
>> >> series, I have a machine that we're using as a NAS that runs various
>> >> rsync processes (mostly at night), lately after a day or two, I will
>> >> come in in the morning to a load average of 49, but the machine not
>> >> really doing anything, when trying to run 'dstat' the command just
>> >> hung with no output at all. there were no errors in the logs, or even
>> >> anything that would vaguely point at anything I could work with.
>> >> So needing to get the machine back to work I attempted to reboot it
>> >> "shutdown -r now" on console... it gives a nice message saying it's
>> >> going to reboot, but nothing ever happens.. the only way to reboot it
>> >> is by using ctrl + alt + sysrq + b. after which the machine reboots
>> >> and the raid array comes back clean.
>> >>
>> >> I'm not sure how to troubleshoot this, any pointers would be appreciated.
>> >>
>> >> I'm compiling 3.2.4 at the moment and found a bunch of possibly useful
>> >> options in the kernel debugging section:
>> >> detect hard/soft lockups and detect hung tasks, maybe it'll give me
>> >> something more to go on.
>> >>
>> >> Some details about the machine:
>> >> Linux xenbox 3.2.2 #1 SMP Sun Jan 29 10:28:22 CST 2012 x86_64 Intel(R)
>> >> Xeon(R) CPU 5140 @ 2.33GHz GenuineIntel GNU/Linux
>> >> It has 3 software raid arrays (2 x 5 drives and 1 x 4 drives) LVM'ed
>> >> together into a 23TB XFS filesystem.
>> >> 6GB memory and a pair of Intel Gigabit ethernet controllers bonded together.
>> > Hmm, might be some deadlock in the filesystem. Adding XFS guys to CC.
>> > Can you run 'echo w >/proc/sysrq-trigger' and post output of dmesg here?
>> >
>> > Honza
>> > --
>> > Jan Kara <jack@...e.cz>
>> > SUSE Labs, CR
>>
>> Thanks for the quick reply,
>> the machine is running good at the moment so I'm not sure if the
>> output helps, but here it is:
>> [I'll also be sure to grab this log the next time it locks]
> Yeah. Sorry, I was not clear but I meant you should grab the traces when
> the machine locks up again...
> Honza
>
> --
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
View attachment "hlog.txt" of type "text/plain" (17943 bytes)
Powered by blists - more mailing lists