[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.1.10.0901311230510.4472@p34.internal.lan>
Date: Sat, 31 Jan 2009 12:37:39 -0500 (EST)
From: Justin Piszcz <jpiszcz@...idpixels.com>
To: linux-kernel@...r.kernel.org
cc: linux-raid@...r.kernel.org, linux-ide@...r.kernel.org,
Alan Piszcz <ap@...arrain.com>, xfs@....sgi.com
Subject: Re: 2.6.28: All I/O, D-state: 3ware
9650SE-16ML/Intel/965/MTRR/PAT?
I think it was a coincidence when I tried to login before, I tried it
again (after trying to re-produce the problem, which I had been able to
do) and this time the mouse worked but the keyboard did not..even running
echo b > sysrq-trigger did not reboot the server.. Had to resort to
manual method.
I am now on kernel 2.6.28.2 and will see if I can make it happen again.
Here are the relevant traces:
http://home.comcast.net/~jpiszcz/20090131/stacktrace-cpus.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-trigger-m.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-trigger-t.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-trigger-w.txt
http://home.comcast.net/~jpiszcz/20090131/sysrq-t.txt
http://home.comcast.net/~jpiszcz/20090131/xfs.txt
The stacktace for each CPU is interesting, see below, XFS related(?)
Can anyone offer any suggestions on what to look for or how to debug this
further?
$ df -i
/dev/sda1 2539035008 1571332 2537463676 1% /r1
Should I be mounting with -o inode64?
/dev/sda1 on /r1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
[2604823.699927] SysRq : Show backtrace of all active CPUs
[2604823.699993] CPU0:
[2604823.699996] ffffffff807d8f38 0000000000000046 ffff880228b516e8 0000000000000000
[2604823.700000] ffff8801f4f93588 000000000000000e ffff8801a8843e98 ffffffff8046ff72
[2604823.700000] ffff880228b516c0 ffffffff80272a8c ffff8801f4f935a8 ffff8801f4f93450
[2604823.700000] Call Trace:
[2604823.700000] <IRQ> [<ffffffff8046ff72>] showacpu+0x42/0x60
[2604823.700000] [<ffffffff80272a8c>] generic_smp_call_function_interrupt+0x4c/0xf0
[2604823.700000] [<ffffffff8023855f>] smp_call_function_interrupt+0x1f/0x40
[2604823.700000] [<ffffffff8022a3bb>] call_function_interrupt+0x6b/0x70
[2604823.700000] <EOI> [<ffffffff802790d6>] find_get_pages+0x56/0xf0
[2604823.700000] [<ffffffff802790b4>] find_get_pages+0x34/0xf0
[2604823.700000] [<ffffffff80281cd7>] pagevec_lookup+0x17/0x20
[2604823.700000] [<ffffffff803e8459>] xfs_cluster_write+0x99/0x170
[2604823.700000] [<ffffffff803e8bdb>] xfs_page_state_convert+0x52b/0x6d0
[2604823.700000] [<ffffffff803e9071>] xfs_vm_writepage+0x71/0x130
[2604823.700000] [<ffffffff802849c0>] shrink_page_list+0x600/0x760
[2604823.700000] [<ffffffff80284d9f>] shrink_list+0x27f/0x640
[2604823.700000] [<ffffffff8040c9e1>] __up_write+0x21/0x150
[2604823.700000] [<ffffffff8028095d>] determine_dirtyable_memory+0xd/0x20
[2604823.700000] [<ffffffff802809cd>] get_dirty_limits+0x1d/0x290
[2604823.700000] [<ffffffff803cd468>] xfs_iomap+0x188/0x310
[2604823.700000] [<ffffffff802853a6>] shrink_zone+0x246/0x340
[2604823.700000] [<ffffffff802861de>] try_to_free_pages+0x22e/0x390
[2604823.700000] [<ffffffff80283330>] isolate_pages_global+0x0/0x270
[2604823.700000] [<ffffffff8027f472>] __alloc_pages_internal+0x212/0x4e0
[2604823.700000] [<ffffffff80281644>] __do_page_cache_readahead+0xf4/0x240
[2604823.700000] [<ffffffff80281ae2>] ondemand_readahead+0x172/0x1f0
[2604823.700000] [<ffffffff8027af39>] generic_file_aio_read+0x329/0x620
[2604823.700000] [<ffffffff803f08da>] xfs_read+0x12a/0x280
[2604823.700000] [<ffffffff802a3ed3>] do_sync_read+0xe3/0x130
[2604823.700000] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
[2604823.700000] [<ffffffff80608577>] thread_return+0x3d/0x5c6
[2604823.700000] [<ffffffff802a4aa8>] vfs_read+0xc8/0x170
[2604823.700000] [<ffffffff802a4c53>] sys_read+0x53/0xa0
[2604823.700000] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
[2604823.700089] CPU2:
[2604823.700091] ffff88022b8d3f38 0000000000000046 ffff880228b516e8 0000000000000002
[2604823.700094] 00007fff4e576050 00007fff4e576048 00007fff4e57605c ffffffff8046ff72
[2604823.700097] ffff880228b516c0 ffffffff80272a8c 0000000000000000 ffff880100003eb0
[2604823.700099] Call Trace:
[2604823.700101] <IRQ> [<ffffffff8046ff72>] showacpu+0x42/0x60
[2604823.700105] [<ffffffff80272a8c>] generic_smp_call_function_interrupt+0x4c/0xf0
[2604823.700108] [<ffffffff8023855f>] smp_call_function_interrupt+0x1f/0x40
[2604823.700110] [<ffffffff8022a3bb>] call_function_interrupt+0x6b/0x70
[2604823.700111] <EOI> [<ffffffff8040f426>] copy_user_generic_string+0x36/0x40
[2604823.700116] [<ffffffff8025a7e2>] sys_rt_sigprocmask+0x62/0x120
[2604823.700119] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
p34:~#
On Sat, 31 Jan 2009, Justin Piszcz wrote:
> I have opened a case with 3ware about this, but I do not believe the 3ware
> card is at fault, it appears something to do with X or the memory subsystem.
> This is not easily reproducible and does not occur often. Maybe once or twice
> per month, if that, hopefully someone can make out what is happening in the
> trace below?
>
> KERNEL:
> # uname -a
> Linux p34.internal.lan 2.6.28 #9 SMP Thu Jan 1 08:07:31 EST 2009 x86_64
> GNU/Linux
>
> PROBLEM:
> During I/O to the 3ware RAID (at least in the past 3-4 previous cases when it
> occured), the system stops all I/O to all processes that are have I/O to
> perform on the 3ware card. I did a lot of analysis which is shown at the
> bottom of this e-mail.
>
> ODDITY:
> After all of the diagnostics I did below(!), what ended up "bringing" the
> system back to life? I moved the mouse cursor on the local machine (I run X)
> and it uses the Intel graphics (Intel DG965WH motherboard), once I logged in,
> the ENTIRE system became responsive again, what the heck!? All of the
> processes in D-state were now freed! I have taken an echo t >
> /proc/sysrq-trigger, hopefully someone can shed some light on this? It
> almost seems like the memory/MTRR(?) is stomping the other memory that is
> used by the system?
>
> QUESTION: Why??
> At the bottom of this e-mail (and also the 3ware ticket):
> Request:- 31st January 2009 at 2:24
> Just as I was about to give up, I went to the machine, woke up the
> screensaver (moved mouse) and logged in and then EVERYTHING started working
> again??
>
> NOTE:
> After all of this, I re-ran the unrar, it worked fine. My guess is X is
> doing
> something very strange with the memory? I can try running no X for awhile
> and see if that gets rid of the problem? Is that the recommended course of
> action, are there other things I can try to debug this problem further? Has
> anyone ever seen anything like this?
>
> CONFIG:
> http://home.comcast.net/~jpiszcz/20090131/config-2.6.28.txt
>
> LOGS: (from Xorg.log)
> (II) intel(0): EDID vendor "DEL", prod id 40984
> (II) intel(0): Output VGA connected
> (II) intel(0): Output VGA using initial mode 1680x1050
> (II) intel(0): detected 512 kB GTT.
> (II) intel(0): detected 7676 kB stolen memory. <- Is this of interest?
> (==) intel(0): video overlay key set to 0x101fe
> (==) intel(0): Will not try to enable page flipping
> (==) intel(0): Triple buffering disabled
> (==) intel(0): Intel XvMC decoder disabled
> (==) intel(0): Using gamma correction (1.0, 1.0, 1.0)
> (**) intel(0): Display dimensions: (430, 270) mm
> (**) intel(0): DPI set to (99, 158)
>
> KERNEL OPTIONS: (of interest?):
> | | [*] MTRR (Memory Type Range Register) support | |
> | | [*] MTRR cleanup support | |
> | | (0) MTRR cleanup enable value (0-1) | |
> | | (1) MTRR cleanup spare reg num (0-7) | |
> | | [*] x86 PAT support | |
>
> Should I consider disabling MTRR, x86 PAT support or both? To see if the
> problem recurs? Are there any other kernel options I have enabled that could
> lead to this problem?
>
> LSPCI:
> 00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev
> 02)
> 00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev
> 02)
> 00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated
> Graphics Controller (rev 02)
> 00:03.0 Communication controller: Intel Corporation 82P965/G965 HECI
> Controller (rev 02)
> 00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network
> Connection (rev 02)
> 00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
> Controller #4 (rev 02)
> 00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
> Controller #5 (rev 02)
> 00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI
> Controller #2 (rev 02)
> 00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio
> Controller (rev 02)
> 00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1
> (rev 02)
> 00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2
> (rev 02)
> 00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3
> (rev 02)
> 00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 4
> (rev 02)
> 00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5
> (rev 02)
> 00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
> Controller #1 (rev 02)
> 00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
> Controller #2 (rev 02)
> 00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
> Controller #3 (rev 02)
> 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI
> Controller #1 (rev 02)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
> 00:1f.0 ISA bridge: Intel Corporation 82801HH (ICH8DH) LPC Interface
> Controller (rev 02)
> 00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port
> SATA AHCI Controller (rev 02)
> 00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev
> 02)
> 01:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID (rev 01)
> 03:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6101 single-port
> PATA133 interface (rev b1)
> 07:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet
> Controller (rev 02)
> 07:01.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet
> Controller
>
> INTERRUPTS:
> # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 8722 0 0 0 IO-APIC-edge timer
> 1: 2 0 0 0 IO-APIC-edge i8042
> 9: 0 0 0 0 IO-APIC-fasteoi acpi
> 12: 3 0 0 0 IO-APIC-edge i8042
> 16: 10253239 34561425 28715428 20097192 IO-APIC-fasteoi 3w-9xxx,
> uhci_hcd:usb3
> 18: 1532 38 0 0 IO-APIC-fasteoi
> ehci_hcd:usb1, uhci_hcd:usb7
> 19: 43319 66498 0 0 IO-APIC-fasteoi
> uhci_hcd:usb6
> 21: 1503630 282221348 34906481 251529888 IO-APIC-fasteoi
> uhci_hcd:usb4, eth1
> 22: 3041085 50898734 44446192 106189021 IO-APIC-fasteoi HDA
> Intel, eth2
> 23: 13630119 27422086 1544853 13244 IO-APIC-fasteoi
> ehci_hcd:usb2, uhci_hcd:usb5
> 375: 3059314 15607219 8335846 3335608 PCI-MSI-edge ahci
> 376: 2259672 275715231 18496746 107799784 PCI-MSI-edge eth0
> 377: 2653268 4958739 0 0 PCI-MSI-edge
> i915@pci:0000:00:02.0
> NMI: 0 0 0 0 Non-maskable interrupts
> LOC: 2603771749 2610047685 2594237486 2589369293 Local timer interrupts
> RES: 5511462 1894029 4550761 1766978 Rescheduling interrupts
> CAL: 15747 15686 15135 6535 Function call interrupts
> TLB: 749115 747241 962808 893832 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 Threshold APIC interrupts
> SPU: 0 0 0 0 Spurious interrupts
> ERR: 0
> MIS: 0
>
> DETAILS (warning! very long):
>
> Request:- 31st January 2009 at 1:45
> This problem has happened 3 or 4 times now, it is very strange, it
> occurs across several kernels, the one I am running now is (2.6.28). It
> occurs across two schedulers (so it is unlikely the kernel); noop and
> deadline.
>
> Essentially what happens is all I/O is blocked during decompression of a file
> and it stays that way:
>
> Extracting from backup.r81
>
> ... backup-20090101.img
> 87%
>
> Extracting from backup.r82
>
> ... backup-20090101.img
> 88%
>
> Extracting from backup.r83
>
> ... backup-20090101.img
> 89%
>
> Extracting from backup.r84
>
> ... backup-20090101.img
> 90%
>
> Here, it is *stuck* at 90%:
>
> PID USER PR NI VIRT SWAP RES SHR S %CPU %MEM TIME P COMMAND
> 348 user 20 0 18616 16m 1572 1108 R 100 0.0 21:59 0 unrar
>
> It sits at 100%, this has happened during VERIFYs and also when the system is
> just plain idle. It seems to be a very nasty bug because the process _never_
> seems to complete.
>
> When this bug occurs, it appears all I/O to the array is blocked:
> $ dd if=/dev/zero of=create_file bs=1M count=1024
>
> It just sits there, when I try to strace the process, I get nothing:
> $ strace -p 770
> Process 770 attached - interrupt to quit
>
> I tried some things:
>
> # /opt/3ware/9500/tw_cli /c0 set verify=5
> Setting Verify Rate on /c0 to 5 (Fastest I/O)...Done.
>
> # /opt/3ware/9500/tw_cli /c0 set verify=1
> Setting Verify Rate on /c0 to 1 (Fastest Verify)...Done.
>
> # /opt/3ware/9500/tw_cli /c0 set verify=3
> Setting Verify Rate on /c0 to 3 (Balanced Verify and I/O )...Done.
>
> No difference.
>
> In its current failure mode:
> //p34> info c0
>
> Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
> ------------------------------------------------------------------------------
> u0 RAID-6 VERIFYING - 91%(A) 64K 12107.1 ON ON u1
> SPARE OK - - - 931.505 - ON
>
>
> It is verifying, maybe it will 'unblock' the I/O when its done verifying?
>
> As you can see, the box is stagnant during this problem:
> $ vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 5 0 83104 47396 136 6586588 0 0 580 442 3 0 1 1 98
> 0
> 5 0 83104 47380 136 6586612 0 0 0 0 4143 129 0 25 75
> 0
> 5 0 83104 47380 136 6586612 0 0 0 0 4063 147 0 25 75
> 0
> 5 0 83104 47380 136 6586612 0 0 0 0 4052 139 0 25 75
> 0
> 5 0 83104 47256 136 6586612 0 0 0 0 4137 199 0 25 75
> 0
> 5 0 83104 47256 136 6586612 0 0 0 0 4916 160 0 25 75
> 0
>
> Here is an iostat on the device (/dev/sda):
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 25.00 0.00 0.00 75.00
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await svctm %util
> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 25.06 0.00 0.00 74.94
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await svctm %util
> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 25.00 0.00 0.00 75.00
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await svctm %util
> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 25.06 0.00 0.00 74.94
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await svctm %util
> sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00
>
> I will let it sit for now and see if it "comes to" after the VERIFY is
> complete. In the interim I have attached the tw_cli /c0 show diag output,
> but that looks normal as well, at least the most recent output, this looks
> like it could be a deeper bug than just something on the surface.
>
> The 25% is probably 1 of the CPUs (quad core) pegged at 100% trying to get
> I/O and the controller is not allowing it!
>
> This problem was *NEVER* experienced when I was not using the 3ware card, it
> only happened after I began using this card in my machine.
>
> # /opt/3ware/9500/tw_cli /c0 show all
> /c0 Driver Version = 2.26.02.011
> /c0 Model = 9650SE-16ML
> /c0 Available Memory = 224MB
> /c0 Firmware Version = FE9X 4.06.00.004
> /c0 Bios Version = BE9X 4.05.00.015
> /c0 Boot Loader Version = BL9X 3.08.00.001
> /c0 Serial Number = [[ snip ]]
> /c0 PCB Version = Rev 032
> /c0 PCHIP Version = 2.00
> /c0 ACHIP Version = 1.90
> /c0 Number of Ports = 16
> /c0 Number of Drives = 16
> /c0 Number of Units = 2
> /c0 Total Optimal Units = 2
> /c0 Not Optimal Units = 0 /c0 JBOD Export Policy = off
> /c0 Disk Spinup Policy = 1
> /c0 Spinup Stagger Time Policy (sec) = 1
> /c0 Auto-Carving Policy = off
> /c0 Auto-Carving Size = 2048 GB
> /c0 Auto-Rebuild Policy = on
> /c0 Controller Bus Type = PCIe
> /c0 Controller Bus Width = 8 lanes
> /c0 Controller Bus Speed = 2.5 Gbps/lane
>
> Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
> ------------------------------------------------------------------------------
> u0 RAID-6 VERIFYING - 92%(A) 64K 12107.1 ON ON u1
> SPARE OK - - - 931.505 - ON
>
> Port Status Unit Size Blocks Serial
> ---------------------------------------------------------------
> p0 OK u0 931.51 GB 1953525168 WD-...
> p1 OK u0 931.51 GB 1953525168 WD-...
> p2 OK u0 931.51 GB 1953525168 WD-...
> p3 OK u0 931.51 GB 1953525168 WD-...
> p4 OK u0 931.51 GB 1953525168 WD-...
> p5 OK u0 931.51 GB 1953525168 WD-...
> p6 OK u0 931.51 GB 1953525168 WD-...
> p7 OK u0 931.51 GB 1953525168 WD-...
> p8 OK u0 931.51 GB 1953525168 WD-...
> p9 OK u0 931.51 GB 1953525168 WD-...
> p10 OK u0 931.51 GB 1953525168 WD-...
> p11 OK u0 931.51 GB 1953525168 WD-...
> p12 OK u0 931.51 GB 1953525168 WD-...
> p13 OK u0 931.51 GB 1953525168 WD-...
> p14 OK u0 931.51 GB 1953525168 WD-...
> p15 OK u1 931.51 GB 1953525168 WD-...
>
> Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
> ---------------------------------------------------------------------------
> bbu On Yes OK OK OK 255 28-Dec-2008
>
> Once this is "over" then I can attach the tw_cli diag output right now its
> locking/blocking any I/O, I cannot even ssh to the host, even though the root
> partition/drives are not on the 3ware controller!!
>
> root 8694 0.0 0.0 0 0 ? D Jan29 2:21 [pdflush]
> root 2226 0.0 0.0 10100 404 ? Ds Jan01 0:14
> /sbin/syslogd -r
>
> Lots of processes in D-state.
>
>
> Request:- 31st January 2009 at 2:8
> Nope, its still LOCKED/blocked/etc:
>
> # /opt/3ware/9500/tw_cli info c0
>
> Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
> ------------------------------------------------------------------------------
> u0 RAID-6 OK - - 64K 12107.1 ON ON u1
> SPARE OK - - - 931.505 - ON
>
> Port Status Unit Size Blocks Serial
> ---------------------------------------------------------------
> p0 OK u0 931.51 GB 1953525168 WD-
> p1 OK u0 931.51 GB 1953525168 WD-
> p2 OK u0 931.51 GB 1953525168 WD-
> p3 OK u0 931.51 GB 1953525168 WD-
> p4 OK u0 931.51 GB 1953525168 WD-
> p5 OK u0 931.51 GB 1953525168 WD-
> p6 OK u0 931.51 GB 1953525168 WD-
> p7 OK u0 931.51 GB 1953525168 WD-
> p8 OK u0 931.51 GB 1953525168 WD-
> p9 OK u0 931.51 GB 1953525168 WD-
> p10 OK u0 931.51 GB 1953525168 WD-
> p11 OK u0 931.51 GB 1953525168 WD-
> p12 OK u0 931.51 GB 1953525168 WD-
> p13 OK u0 931.51 GB 1953525168 WD-
> p14 OK u0 931.51 GB 1953525168 WD-
> p15 OK u1 931.51 GB 1953525168 WD-
>
> Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
> ---------------------------------------------------------------------------
> bbu On Yes OK OK OK 255 28-Dec-2008
>
> Completely blocking all I/O..
>
> p34:~# ps auxww | grep -i unrar
> user 348 99.6 0.0 18616 1572 pts/19 R+ 04:05 60:59 unrar x
> backup01.rar
> root 865 0.0 0.0 4220 724 pts/29 S+ 05:06 0:00 grep -i
> unrar
> p34:~# kill -9 348
> p34:~# p34:~# killall -9 dd
> p34:~#
>
> Still...... BLOCKED...
>
>
>
> Request:- 31st January 2009 at 2:13
> I was able to get a call trace, when I tried to write it to a file
> though, it was BLOCKED, so this is from a free ssh session I had open:
>
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff80264979>] autoremove_wake_function+0x9/0x30
> [2578818.429166] [<ffffffff8024597a>] __wake_up_common+0x5a/0x90
> [2578818.429166] [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
> [2578818.429166] [<ffffffff80264a83>] prepare_to_wait_exclusive+0x23/0x80
> [2578818.429166] [<ffffffff805dcad6>] unix_wait_for_peer+0x86/0xd0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff80575cd7>] memcpy_fromiovec+0x57/0x80
> [2578818.429166] [<ffffffff805dd170>] unix_dgram_sendmsg+0x450/0x5a0
> [2578818.429166] [<ffffffff8056d77b>] sock_sendmsg+0xcb/0x100
> [2578818.429166] [<ffffffff802af076>] do_path_lookup+0xd6/0x1a0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802af15d>] kern_path+0x1d/0x40
> [2578818.429166] [<ffffffff8025408e>] current_fs_time+0x1e/0x30
> [2578818.429166] [<ffffffff802baa9d>] mnt_want_write+0x2d/0xa0
> [2578818.429166] [<ffffffff802bb637>] mntput_no_expire+0x27/0x140
> [2578818.429166] [<ffffffff805dccc2>] unix_find_other+0x1a2/0x200
> [2578818.429166] [<ffffffff8056cd62>] sockfd_lookup_light+0x22/0x90
> [2578818.429166] [<ffffffff8056db89>] sys_sendto+0x119/0x180
> [2578818.429166] [<ffffffff805dd5ba>] unix_dgram_connect+0xaa/0x250
> [2578818.429166] [<ffffffff8056d046>] sys_connect+0x86/0xe0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] cron S ffff88022aae7bc8 0 861 3044
> [2578818.429166] ffff8801005b9b88 0000000000000086 00007f68d6c77000
> ffffe2000453ab70
> [2578818.429166] ffffffff807cd000 ffff8801005b5260 ffff880211f8f160
> ffff8801005b5490
> [2578818.429166] 000000032abe1b60 ffff8801005b9d18 ffff8801005b5490
> ffffffff8028b45b
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff8028b45b>] handle_mm_fault+0x2bb/0x7a0
> [2578818.429166] [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
> [2578818.429166] [<ffffffff80264a83>] prepare_to_wait_exclusive+0x23/0x80
> [2578818.429166] [<ffffffff805dcad6>] unix_wait_for_peer+0x86/0xd0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff80575cd7>] memcpy_fromiovec+0x57/0x80
> [2578818.429166] [<ffffffff805dd170>] unix_dgram_sendmsg+0x450/0x5a0
> [2578818.429166] [<ffffffff8056d77b>] sock_sendmsg+0xcb/0x100
> [2578818.429166] [<ffffffff802af076>] do_path_lookup+0xd6/0x1a0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802af15d>] kern_path+0x1d/0x40
> [2578818.429166] [<ffffffff8025408e>] current_fs_time+0x1e/0x30
> [2578818.429166] [<ffffffff802baa9d>] mnt_want_write+0x2d/0xa0
> [2578818.429166] [<ffffffff802bb637>] mntput_no_expire+0x27/0x140
> [2578818.429166] [<ffffffff805dccc2>] unix_find_other+0x1a2/0x200
> [2578818.429166] [<ffffffff8056cd62>] sockfd_lookup_light+0x22/0x90
> [2578818.429166] [<ffffffff8056db89>] sys_sendto+0x119/0x180
> [2578818.429166] [<ffffffff805dd5ba>] unix_dgram_connect+0xaa/0x250
> [2578818.429166] [<ffffffff8056d046>] sys_connect+0x86/0xe0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] cron S 00000000ffffffff 0 862 3044
> [2578818.429166] ffff8801005bbb88 0000000000000086 00007f68d6c77000
> ffffe2000453ab70
> [2578818.429166] ffffffff807cd000 ffff8801005b4cf0 ffff88022b8950e0
> ffff8801005b4f20
> [2578818.429166] 0000000129928420 0000000199ae5567 ffff8801005b4f20
> ffffffff8028b45b
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff8028b45b>] handle_mm_fault+0x2bb/0x7a0
> [2578818.429166] [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
> [2578818.429166] [<ffffffff80264a83>] prepare_to_wait_exclusive+0x23/0x80
> [2578818.429166] [<ffffffff805dcad6>] unix_wait_for_peer+0x86/0xd0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff80575cd7>] memcpy_fromiovec+0x57/0x80
> [2578818.429166] [<ffffffff805dd170>] unix_dgram_sendmsg+0x450/0x5a0
> [2578818.429166] [<ffffffff8056d77b>] sock_sendmsg+0xcb/0x100
> [2578818.429166] [<ffffffff802af076>] do_path_lookup+0xd6/0x1a0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802af15d>] kern_path+0x1d/0x40
> [2578818.429166] [<ffffffff8025408e>] current_fs_time+0x1e/0x30
> [2578818.429166] [<ffffffff802baa9d>] mnt_want_write+0x2d/0xa0
> [2578818.429166] [<ffffffff802bb637>] mntput_no_expire+0x27/0x140
> [2578818.429166] [<ffffffff805dccc2>] unix_find_other+0x1a2/0x200
> [2578818.429166] [<ffffffff8056cd62>] sockfd_lookup_light+0x22/0x90
> [2578818.429166] [<ffffffff8056db89>] sys_sendto+0x119/0x180
> [2578818.429166] [<ffffffff805dd5ba>] unix_dgram_connect+0xaa/0x250
> [2578818.429166] [<ffffffff8056d046>] sys_connect+0x86/0xe0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
>
>
> Request:- 31st January 2009 at 2:16
> These are the 'optimizations' I use on this array:
>
> echo 128 > /sys/block/$i/queue/max_sectors_kb # 128 is default
> echo 512 > /sys/block/$i/queue/nr_requests # 128 is default
> # echo 254 > /sys/block/$i/device/queue_depth # 254 is default
> echo deadline > /sys/block/$i/queue/scheduler # distribution dependent
> blockdev --setra 16384 /dev/$i # set readahead
>
> In the past I had been able to kill the process(es) involved or they died
> eventually, however, this does not seem to be the case this time, I still
> cannot get my system back!
>
> Request:- 31st January 2009 at 2:20
> Here is more of the echo t > sysrq-trigger:
>
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff805de205>] unix_stream_sendmsg+0x3b5/0x3d0
> [2578818.429166] [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
> [2578818.429166] [<ffffffff80264b03>] prepare_to_wait+0x23/0x80
> [2578818.429166] [<ffffffff805ddc92>] unix_stream_recvmsg+0x432/0x5f0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802aeb6c>] __link_path_walk+0xdbc/0xf20
> [2578818.429166] [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
> [2578818.429166] [<ffffffff802a3ed3>] do_sync_read+0xe3/0x130
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802a4b3e>] vfs_read+0x15e/0x170
> [2578818.429166] [<ffffffff802a4c53>] sys_read+0x53/0xa0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] sshd S 00000000ffffffff 0 32669 32667
> [2578818.429166] ffff8801c4b15998 0000000000000086 ffffffff80755008
> ffffffff8040c2c0
> [2578818.429166] ffffffff807cd000 ffff88022b968700 ffff88022b8f2bb0
> ffff88022b968930
> [2578818.429166] 0000000328057c18 0000000199b05a5c ffff88022b968930
> ffff8801c4b159a8
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff8040c2c0>] rb_insert_color+0x100/0x130
> [2578818.429166] [<ffffffff8026800f>] hrtimer_start_range_ns+0xdf/0x1b0
> [2578818.429166] [<ffffffff80267eb9>] hrtimer_try_to_cancel+0x39/0x80
> [2578818.429166] [<ffffffff80609415>] schedule_hrtimeout_range+0xb5/0x150
> [2578818.429166] [<ffffffff802678e0>] hrtimer_wakeup+0x0/0x30
> [2578818.429166] [<ffffffff802b35d9>] do_select+0x539/0x630
> [2578818.429166] [<ffffffff802b3d30>] __pollwait+0x0/0x130
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80575f57>] memcpy_toiovec+0x57/0x80
> [2578818.429166] [<ffffffff8060a529>] _spin_lock_bh+0x9/0x20
> [2578818.429166] [<ffffffff8056fcc3>] release_sock+0x13/0xa0
> [2578818.429166] [<ffffffff805ab7dc>] tcp_recvmsg+0x32c/0x940
> [2578818.429166] [<ffffffff8056f110>] sock_common_recvmsg+0x30/0x50
> [2578818.429166] [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
> [2578818.429166] [<ffffffff802b38cf>] core_sys_select+0x1ff/0x310
> [2578818.429166] [<ffffffff80240bb0>] do_page_fault+0x2f0/0x980
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff8026aa98>] getnstimeofday+0x48/0xc0
> [2578818.429166] [<ffffffff802308b9>] read_tsc+0x9/0x20
> [2578818.429166] [<ffffffff8026aa98>] getnstimeofday+0x48/0xc0
> [2578818.429166] [<ffffffff802b3c71>] sys_select+0x51/0x110
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] bash S 00000000ffffffff 0 32670 32669
> [2578818.429166] ffff88022607de88 0000000000000082 ffff8800b773cc60
> 00000000006c4e5c
> [2578818.429166] ffffffff807cd000 ffff8801006a8410 ffffffff8071a340
> ffff8801006a8640
> [2578818.429166] 00000000006a8410 000000019977482f ffff8801006a8640
> ffffffff807258c8
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff8045be2b>] tty_check_change+0x11b/0x130
> [2578818.429166] [<ffffffff8025191e>] session_of_pgrp+0xe/0x50
> [2578818.429166] [<ffffffff8045c6f2>] tty_ioctl+0x8b2/0x920
> [2578818.429166] [<ffffffff80252872>] do_wait+0x272/0x350
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff802529e5>] sys_wait4+0x95/0xf0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] sr S 00000000ffffffff 0 32677 32670
> [2578818.429166] ffff88008f50fe88 0000000000000086 ffff88010ccebbb8
> 00000000006c4e5c
> [2578818.429166] ffffffff807cd000 ffff880148a6f260 ffffffff8071a340
> ffff880148a6f490
> [2578818.429166] 0000000048a6f260 000000019977482f ffff880148a6f490
> ffffffff807258c8
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff80252872>] do_wait+0x272/0x350
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff802529e5>] sys_wait4+0x95/0xf0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] screen S 00000000ffffffff 0 32683 32677
> [2578818.429166] ffff88020a9e5f68 0000000000000086 ffff88022ba8f160
> ffff88020a9e5f28
> [2578818.429166] ffffffff807cd000 ffff88022ba8f160 ffff88022b8950e0
> ffff88022ba8f390
> [2578818.429166] 000000015afa2f00 0000000199b0f5e0 ffff88022ba8f390
> 00007fff5afa5be2
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff80253955>] alarm_setitimer+0x35/0x70
> [2578818.429166] [<ffffffff8025a2e9>] sys_pause+0x19/0x30
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] sshd S 00000000ffffffff 0 32684 10156
> [2578818.429166] ffff880198c4bbe8 0000000000000082 ffff880228808000
> ffff880198c4bb68
> [2578818.429166] ffffffff807cd000 ffff880100552090 ffff88022b8f2bb0
> ffff8801005522c0
> [2578818.429166] 0000000398c4bb68 000000019977506f ffff8801005522c0
> ffff8800058e33c0
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff805de205>] unix_stream_sendmsg+0x3b5/0x3d0
> [2578818.429166] [<ffffffff80608ced>] schedule_timeout+0xad/0xf0
> [2578818.429166] [<ffffffff80264b03>] prepare_to_wait+0x23/0x80
> [2578818.429166] [<ffffffff805ddc92>] unix_stream_recvmsg+0x432/0x5f0
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802aeb6c>] __link_path_walk+0xdbc/0xf20
> [2578818.429166] [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
> [2578818.429166] [<ffffffff802a3ed3>] do_sync_read+0xe3/0x130
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff802a4b3e>] vfs_read+0x15e/0x170
> [2578818.429166] [<ffffffff802a4c53>] sys_read+0x53/0xa0
> [2578818.429166] [<ffffffff8022986b>] system_call_fastpath+0x16/0x1b
> [2578818.429166] sshd S 00000000ffffffff 0 32686 32684
> [2578818.429166] ffff880040dff998 0000000000000086 ffff880040dff9a8
> ffffffff8040c250
> [2578818.429166] ffffffff807cd000 ffff8801e96411e0 ffff88022b8f2bb0
> ffff8801e9641410
> [2578818.429166] 0000000328057c18 0000000199b0a381 ffff8801e9641410
> ffff880040dff9a8
> [2578818.429166] Call Trace:
> [2578818.429166] [<ffffffff8040c250>] rb_insert_color+0x90/0x130
> [2578818.429166] [<ffffffff8026800f>] hrtimer_start_range_ns+0xdf/0x1b0
> [2578818.429166] [<ffffffff80267eb9>] hrtimer_try_to_cancel+0x39/0x80
> [2578818.429166] [<ffffffff80609415>] schedule_hrtimeout_range+0xb5/0x150
> [2578818.429166] [<ffffffff802678e0>] hrtimer_wakeup+0x0/0x30
> [2578818.429166] [<ffffffff802b35d9>] do_select+0x539/0x630
> [2578818.429166] [<ffffffff802b3d30>] __pollwait+0x0/0x130
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80249e30>] default_wake_function+0x0/0x10
> [2578818.429166] [<ffffffff80575f57>] memcpy_toiovec+0x57/0x80
> [2578818.429166] [<ffffffff8060a529>] _spin_lock_bh+0x9/0x20
> [2578818.429166] [<ffffffff8056fcc3>] release_sock+0x13/0xa0
> [2578818.429166] [<ffffffff805ab7dc>] tcp_recvmsg+0x32c/0x940
> [2578818.429166] [<ffffffff8056f110>] sock_common_recvmsg+0x30/0x50
> [2578818.429166] [<ffffffff8056c60f>] sock_aio_read+0x14f/0x160
> [2578818.429166] [<ffffffff802b38cf>] core_sys_select+0x1ff/0x310
> [2578818.429166] [<ffffffff8045e7a1>] n_tty_read+0x371/0x870
> [2578818.429166] [<ffffffff80264970>] autoremove_wake_function+0x0/0x30
> [2578818.429166] [<ffffffff80249e30>] default_wake_fun
>
> Request:- 31st January 2009 at 2:24
> Just as I was about to give up, I went to the machine, woke up the
> screensaver (moved mouse) and logged in and then EVERYTHING started working
> again??
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists