lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 22 Oct 2006 15:19:20 +0200
From:	Andi Kleen <ak@...e.de>
To:	Robert Hancock <hancockr@...w.ca>
Cc:	linux-kernel@...r.kernel.org, linux-ide@...r.kernel.org
Subject: Re: [PATCH] sata_nv ADMA/NCQ support for nForce4 (updated) II

On Sunday 22 October 2006 09:09, Robert Hancock wrote:
> Andi Kleen wrote:
> > Andi Kleen <ak@...e.de> writes:
> > 
> >> I tested it on a NF4-Professional system with 8GB RAM and a single 
> >> SATA disk. It first did nicely in LTP and some other tests,
> >> but during a bonnie++ run it eventually blocked with all
> >> IO hanging forever. No output either. I did a full backtrace
> >> and it just showed the processes waiting for a IO wakeup.
> > 
> > Hmm, to follow myself up: after a few more minutes the machine recovered
> > and i could log in again (overall the stall was at least 5+ minutes
> > though)
> > 
> > Not sure whom to blame, the IO driver might be actually innocent
> > and it just be one of the usual known but unfixed IO starvation problems.
> > 
> > -Andi
> 
> Hmm.. The system hanging up for 5 minutes 

I've actually seen that before on different systems. Sometimes
under some IO loads writes can be really starved for that long
and they block the calling process. Normally it only happened
when a very slow IO device (like slow USB storage) was involved

e.g. typical trace:

sshd          D ffff810001072b20     0 11554   3381         11556 11127 (NOTLB)
 ffff810114dffb08 0000000000000086 5353535353535353 5353535353535353
 5353535353535353 000000000000057e ffff81014b344af0 ffff81014b404770
 000001ede41c7140 ffff81014b344cc8 5353535300000001 5353535353535353
Call Trace:
 [<ffffffff802e8cc2>] start_this_handle+0x2f4/0x37b
 [<ffffffff802e8e16>] journal_start+0xcd/0x105
 [<ffff81014b11e800>]
DWARF2 unwinder stuck at 0xffff81014b11e800
Leftover inexact backtrace:
 [<ffffffff802da5f5>] ext3_dirty_inode+0x28/0x7b
 [<ffffffff80291bbb>] __mark_inode_dirty+0x2c/0x17d
 [<ffffffff80256611>] do_generic_mapping_read+0x3b0/0x3c2
 [<ffffffff80255415>] file_read_actor+0x0/0xd6
 [<ffffffff80256d8b>] generic_file_aio_read+0x164/0x1b8
 [<ffffffff80278774>] do_sync_read+0xc9/0x10c
 [<ffffffff80241ecc>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80531aef>] cond_resched+0x34/0x3b
 [<ffffffff80258f27>] __alloc_pages+0x5e/0x2ae
 [<ffffffff8027365a>] cache_alloc_refill+0xf1/0x1f8
 [<ffffffff80278ade>] vfs_read+0xa8/0x14e
 [<ffffffff8027bbbd>] kernel_read+0x38/0x4c
 [<ffffffff8027d6d4>] do_execve+0x105/0x1f9
 [<ffffffff80207bc9>] sys_execve+0x33/0x8b
 [<ffffffff80209857>] stub_execve+0x67/0xb0


I've got quite a lot of processes in journal_start -> start_this_handle.
I suppose they're waiting for the transaction to finish.

> and then recovering seems  
> rather odd, as far as I know the timeouts in libata are all quite a bit 
> shorter than that. Was there anything unusual in dmesg? 

Nothing in dmesg except for the sysrqs I triggered.

> If the IO  
> commands weren't completing at the driver level then I would expect the 
> error handling to kick in in some fashion..

And print something, yes.

I assume it wasn't the driver for now.

Thanks for your work.

-Andi

 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ