linux-kernel - Re: Linux 2.6.25-rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200803061438.22205.bzolnier@gmail.com>
Date:	Thu, 6 Mar 2008 14:38:21 +0100
From:	Bartlomiej Zolnierkiewicz <bzolnier@...il.com>
To:	Jens Axboe <jens.axboe@...cle.com>
Cc:	Ingo Molnar <mingo@...e.hu>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Anders Eriksson <aeriksson@...tmail.fm>
Subject: Re: Linux 2.6.25-rc4

On Thursday 06 March 2008, Jens Axboe wrote:
> On Thu, Mar 06 2008, Ingo Molnar wrote:
> > 
> > * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > 
> > > In particular, the block layer changes should hopefully have sorted 
> > > themselves out, and CD burning etc hopefully works for people again. 
> > 
> > hm, tonight's randconfig bootrun produced a failing (soft-hung) kernel 
> > after about 120 iterations - and the log i captured _seems_ to indicate 
> > some block IO (or libata) completion weirdness.
> > 
> > unfortunately, it's not readily reproducible, and i triggered it with 
> > about 100 sched.git and 300 x86.git patches applied. BUT, virtually the 
> > same 100+300 patches queue produced a successful 1000+ randconfig 
> > testrun over the last weekend so i'm reasonably sure the regression is 
> > new and came in via upstream. Also, the config is UP (and it's a rather 
> > simple config in other aspects as well), so this must be something 
> > rather fundamental, not an SMP race.
> > 
> > I just spent about an hour trying to figure out a pattern but the bug 
> > just doesnt reproduce after 20 bootup attempts with the same bzImage. 
> > When it hung then it hung for hours, so the condition is permanent.
> > 
> > I've attached the bootup log which includes the SysRq-T output and the 
> > config. The hang seems to occur because an rc.sysinit task is not coming 
> > back from io_schedule():
> > 
> > rc.sysinit    D f75bcc24     0  1922   1893
> >        f761c810 00000086 f75bcd38 f75bcc24 1954bff5 00000015 f7746000 f761c974 
> >        f761c974 f7c17698 c180e7a8 f7747cc4 00000000 f7747ccc c180e7a8 c097bff7 
> >        c01a3acb c097c27d c01a3aa0 f7872a90 00000002 c01a3aa0 f7747e48 c097c2fc 
> > Call Trace:
> >  [<c097bff7>] io_schedule+0x37/0x70
> >  [<c01a3acb>] sync_buffer+0x2b/0x30
> >  [<c097c27d>] __wait_on_bit+0x4d/0x80
> >  [<c01a3aa0>] sync_buffer+0x0/0x30
> >  [<c01a3aa0>] sync_buffer+0x0/0x30
> >  [<c097c2fc>] out_of_line_wait_on_bit+0x4c/0x60
> >  [<c0142340>] wake_bit_function+0x0/0x40
> >  [<c01a3a51>] __wait_on_buffer+0x21/0x30
> >  [<c0209915>] ext3_bread+0x55/0x70
> >  [<c020cff8>] ext3_find_entry+0x258/0x660
> >  [<c03a0026>] avc_has_perm+0x46/0x50
> >  [<c03a0d14>] inode_has_perm+0x44/0x80
> >  [<c020de69>] ext3_lookup+0x29/0xa0
> >  [<c0189f90>] do_lookup+0x130/0x180
> >  [<c018b540>] __link_path_walk+0x340/0xd50
> >  [<c03a0d14>] inode_has_perm+0x44/0x80
> >  [<c018bf8a>] link_path_walk+0x3a/0xa0
> >  [<c016feb4>] __do_fault+0x1a4/0x3d0
> >  [<c018c1b7>] do_path_lookup+0x77/0x210
> >  [<c018cb57>] __user_walk_fd+0x27/0x40
> >  [<c01860d5>] vfs_stat_fd+0x15/0x40
> >  [<c016feb4>] __do_fault+0x1a4/0x3d0
> >  [<c01861ef>] sys_stat64+0xf/0x30
> >  [<c0125a5d>] do_page_fault+0x2ad/0x670
> >  [<c03db6cc>] trace_hardirqs_on_thunk+0xc/0x10
> >  [<c0115a5f>] sysenter_past_esp+0x5f/0x90
> >  =======================
> 
> Sorry, I have _no_ ideas on what this could be. We haven't really had
> any related changes in the block layer over that short a time frame. It
> could of course have been introduced earlier, since it seems to be quite
> elusive.
> 
> Presumably any hw issues would get noticed (like missing interrupt) and
> trigger the error handler, so it looks like this IO is still stuck in
> the queue somewhere. That mainly points the finger at AS, but given that
> you cannot reproduce I'm not sure how best to proceed with this...

The problem looks very similar to the one recently reported by Anders:

http://lkml.org/lkml/2008/2/22/239

"Trying out 2.6.25-rc2 smartd always causes my box to hang. I can switch 
vt:s and the keyboard seems to work.

Using sysrq-e I noticed a callpath open -> ext3 -> journals -> sync_buffer -> 
io_scheduel -> generic_unplig_device.
I'd guess the open stems from smartd. Removing smartd from the startup, I'm 
now using rc2 fine..."

Initially we thought that it is an IDE regression but after some testing
and further investigation it seems that IDE changes just made it more
likely to occur (yesterday it turned out that the kernel with the "guilty"
IDE commit sometimes boots fine).  The issue is easily reproducible.

PS We've already verified that it is not PREEMPT related or a compiler bug. 

Thanks,
Bart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/