linux-kernel - Re: Linux 2.6.25-rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Sun, 16 Mar 2008 21:59:46 +0300
From:	Andrey Borzenkov <arvidjaar@...l.ru>
To:	Jens Axboe <jens.axboe@...cle.com>, mingo@...e.hu,
	torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: Linux 2.6.25-rc4

Jens Axboe wrote:

> On Thu, Mar 06 2008, Ingo Molnar wrote:
>> 
>> * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
>> 
>> > In particular, the block layer changes should hopefully have sorted
>> > themselves out, and CD burning etc hopefully works for people again.
>> 
>> hm, tonight's randconfig bootrun produced a failing (soft-hung) kernel
>> after about 120 iterations - and the log i captured _seems_ to indicate
>> some block IO (or libata) completion weirdness.
>> 
>> unfortunately, it's not readily reproducible, and i triggered it with
>> about 100 sched.git and 300 x86.git patches applied. BUT, virtually the
>> same 100+300 patches queue produced a successful 1000+ randconfig
>> testrun over the last weekend so i'm reasonably sure the regression is
>> new and came in via upstream. Also, the config is UP (and it's a rather
>> simple config in other aspects as well), so this must be something
>> rather fundamental, not an SMP race.
>> 
>> I just spent about an hour trying to figure out a pattern but the bug
>> just doesnt reproduce after 20 bootup attempts with the same bzImage.
>> When it hung then it hung for hours, so the condition is permanent.
>> 
>> I've attached the bootup log which includes the SysRq-T output and the
>> config. The hang seems to occur because an rc.sysinit task is not coming
>> back from io_schedule():
>> 
>> rc.sysinit    D f75bcc24     0  1922   1893
>>        f761c810 00000086 f75bcd38 f75bcc24 1954bff5 00000015 f7746000
>>        f761c974 f761c974 f7c17698 c180e7a8 f7747cc4 00000000 f7747ccc
>>        c180e7a8 c097bff7 c01a3acb c097c27d c01a3aa0 f7872a90 00000002
>>        c01a3aa0 f7747e48 c097c2fc
>> Call Trace:
>>  [<c097bff7>] io_schedule+0x37/0x70
>>  [<c01a3acb>] sync_buffer+0x2b/0x30
>>  [<c097c27d>] __wait_on_bit+0x4d/0x80
>>  [<c01a3aa0>] sync_buffer+0x0/0x30
>>  [<c01a3aa0>] sync_buffer+0x0/0x30
>>  [<c097c2fc>] out_of_line_wait_on_bit+0x4c/0x60
>>  [<c0142340>] wake_bit_function+0x0/0x40
>>  [<c01a3a51>] __wait_on_buffer+0x21/0x30
>>  [<c0209915>] ext3_bread+0x55/0x70
>>  [<c020cff8>] ext3_find_entry+0x258/0x660
>>  [<c03a0026>] avc_has_perm+0x46/0x50
>>  [<c03a0d14>] inode_has_perm+0x44/0x80
>>  [<c020de69>] ext3_lookup+0x29/0xa0
>>  [<c0189f90>] do_lookup+0x130/0x180
>>  [<c018b540>] __link_path_walk+0x340/0xd50
>>  [<c03a0d14>] inode_has_perm+0x44/0x80
>>  [<c018bf8a>] link_path_walk+0x3a/0xa0
>>  [<c016feb4>] __do_fault+0x1a4/0x3d0
>>  [<c018c1b7>] do_path_lookup+0x77/0x210
>>  [<c018cb57>] __user_walk_fd+0x27/0x40
>>  [<c01860d5>] vfs_stat_fd+0x15/0x40
>>  [<c016feb4>] __do_fault+0x1a4/0x3d0
>>  [<c01861ef>] sys_stat64+0xf/0x30
>>  [<c0125a5d>] do_page_fault+0x2ad/0x670
>>  [<c03db6cc>] trace_hardirqs_on_thunk+0xc/0x10
>>  [<c0115a5f>] sysenter_past_esp+0x5f/0x90
>>  =======================
> 
> Sorry, I have _no_ ideas on what this could be. We haven't really had
> any related changes in the block layer over that short a time frame. It
> could of course have been introduced earlier, since it seems to be quite
> elusive.
> 
> Presumably any hw issues would get noticed (like missing interrupt) and
> trigger the error handler, so it looks like this IO is still stuck in
> the queue somewhere. That mainly points the finger at AS, but given that
> you cannot reproduce I'm not sure how best to proceed with this...
> 

Sorry to jump in but I now realized that this looks too similar to the
problem I have reported back at 2.6.22 times; the full thread with title

[possible regression] 2.6.22 reiserfs/libata sporadically hangs on resume
from hibernation

is here: (http://marc.info/?t=118317982600002&r=1&w=2)

and the stacks which I managed to get
(http://marc.info/?l=linux-kernel&m=118317970501209&w=2) have exactly the
same __wait_on_buffer as the one Ingo posted.

May be it rings the bell for someone. Oh, and last time it happened I used
IDE and not libata. Looks like both libata and reiser were red herring
after all.

-andrey

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/