[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTi=45J+=ytj=NC4bYTVJFRakNQQiFd35K7mrf+Gq@mail.gmail.com>
Date: Sat, 25 Sep 2010 15:08:11 +0200
From: Torsten Kaiser <just.for.lkml@...glemail.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: Dave Chinner <david@...morbit.com>,
Yang Ruirui <ruirui.r.yang@...to.com>,
Andrew Morton <akpm@...ux-foundation.org>, xfs@....sgi.com,
linux-kernel@...r.kernel.org, Alex Elder <aelder@....com>,
Tejun Heo <tj@...nel.org>
Subject: Re: -mm: xfs lockdep warning
[adding Tejun Heo, because I'm partly blaming his workqueues]
On Mon, Sep 20, 2010 at 9:13 PM, Christoph Hellwig <hch@...radead.org> wrote:
> On Fri, Sep 17, 2010 at 10:52:27AM +1000, Dave Chinner wrote:
>> Christoph, this implies an inode that has been marked for reclaim
>> that has not passed through xfs_fs_evict_inode() after being
>> initialised. If it went through the eviction process, the iolock
>> would have been re-initialised to a different context. Can you think
>> of any path that can get here without going through ->evict? I can't
>> off the top of my head...
>
> I think this could happen if the init_inode_always during
> re-initialization of an inode in reclaim fails in iget. I have a patch
> to add that I'll run through xfsqa. It should only happen very rarely.
I had the same lockdep report in mainline 2.6.36-rc5, when I enabled
lockdep to try to debug a hang under high load and some memory
pressure.
Do you have a patch I could try to get rid of this lockdep report to
see, if there is another one lurking behind it?
As for the hang: Today I wanted to install updates on my gentoo system
and that included an update of koffice. Because I allow emerge
(gentoos package manager) to process 4 packages in parallel and
specify -j5 for each of the compiles the result was, that emerge
decided to build 4 koffice programs in parallel which resulted in a
load ~18 on my 4 core system. With all these gcc running and the
compiles happening on a tmpfs this also resulted in some memory
pressure. (But not in a swap storm)
The first time I got this hang, I was not able to capture useful
informations, the second time it happened I got a report from the hung
task timeout:
3 processes where blocked in:
[ 1203.056532] [<ffffffff810801c0>] ? sync_page+0x0/0x50
[ 1203.061836] [<ffffffff8156078d>] ? io_schedule+0x3d/0x60
[ 1203.067390] [<ffffffff810801fd>] ? sync_page+0x3d/0x50
probably because of these:
[ 1202.872424] INFO: task plasma-desktop:3169 blocked for more than 120 seconds.
[ 1202.879883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1202.888118] ffff88011ef860b0 0000000000000086 000000011ef860b0
ffff880114faff48
[ 1202.895936] ffff8800071cf1c0 0000000000012780 ffff880114faffd8
0000000000012780
[ 1202.903777] ffff880114faffd8 ffff88011ef86310 ffff880114faffd8
ffff88011ef86308
[ 1202.911586] Call Trace:
[ 1202.914140] [<ffffffff815612c1>] ? __mutex_lock_slowpath+0xe1/0x160
[ 1202.920693] [<ffffffff810d1e57>] ? __d_lookup+0x97/0x120
[ 1202.926272] [<ffffffff81560d8a>] ? mutex_lock+0x1a/0x40
[ 1202.931783] [<ffffffff810c9dbf>] ? do_lookup+0x10f/0x190
[ 1202.937426] [<ffffffff810c7e24>] ? acl_permission_check+0x54/0xb0
[ 1202.943898] [<ffffffff810ca8d9>] ? link_path_walk+0x4c9/0x9a0
[ 1202.950028] [<ffffffff810d65f4>] ? mnt_want_write+0x34/0x70
[ 1202.955947] [<ffffffff810caecf>] ? path_walk+0x5f/0xe0
[ 1202.961416] [<ffffffff810caf9b>] ? do_path_lookup+0x4b/0x50
[ 1202.967325] [<ffffffff810cbc35>] ? user_path_at+0x55/0xb0
[ 1202.973020] [<ffffffff8102e860>] ? __dequeue_entity+0x40/0x50
[ 1202.979054] [<ffffffff8107bcf4>] ? perf_event_task_sched_out+0x44/0x260
[ 1202.985995] [<ffffffff810c2e46>] ? vfs_fstatat+0x36/0x80
[ 1202.991600] [<ffffffff815602ac>] ? schedule+0x24c/0x6f0
[ 1202.997157] [<ffffffff810c2f8f>] ? sys_newstat+0x1f/0x50
[ 1203.002784] [<ffffffff81003035>] ? device_not_available+0x15/0x20
[ 1203.009204] [<ffffffff8100246b>] ? system_call_fastpath+0x16/0x1b
[ 1202.400046] INFO: task kworker/u:8:845 blocked for more than 120 seconds.
[ 1202.407128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1202.415372] ffff88007ffa8b60 0000000000000046 ffff88007f74fe98
0000000000000000
[ 1202.423229] ffff8800071cf770 0000000000012780 ffff8800070e9fd8
0000000000012780
[ 1202.431054] ffff8800070e9fd8 ffff88007ffa8dc0 ffff8800070e9fd8
ffff88007ffa8db8
[ 1202.438898] Call Trace:
[ 1202.441495] [<ffffffff814361ed>] ? md_write_start+0x9d/0x190
[ 1202.447536] [<ffffffff81052e70>] ? autoremove_wake_function+0x0/0x30
[ 1202.454311] [<ffffffff810925ca>] ? zone_nr_free_pages+0x9a/0xb0
[ 1202.460639] [<ffffffff8142c423>] ? make_request+0x23/0x800
[ 1202.466496] [<ffffffff811d90ea>] ? blkcipher_walk_done+0x8a/0x230
[ 1202.472990] [<ffffffff811df4c4>] ? crypto_cbc_encrypt+0xe4/0x180
[ 1202.479327] [<ffffffff8102a150>] ? twofish_encrypt+0x0/0x10
[ 1202.485276] [<ffffffff81431ac3>] ? md_make_request+0xc3/0x220
[ 1202.491412] [<ffffffff811eafca>] ? generic_make_request+0x18a/0x330
[ 1202.498096] [<ffffffff814484fc>] ? crypt_convert+0x25c/0x310
[ 1202.504090] [<ffffffff81448881>] ? kcryptd_crypt+0x2d1/0x3f0
[ 1202.510031] [<ffffffff814485b0>] ? kcryptd_crypt+0x0/0x3f0
[ 1202.516081] [<ffffffff8104d2eb>] ? process_one_work+0xfb/0x370
[ 1202.522123] [<ffffffff8104ee4c>] ? worker_thread+0x16c/0x360
[ 1202.527935] [<ffffffff8104ece0>] ? worker_thread+0x0/0x360
[ 1202.533541] [<ffffffff8104ece0>] ? worker_thread+0x0/0x360
[ 1202.539131] [<ffffffff810529e6>] ? kthread+0x96/0xa0
[ 1202.544245] [<ffffffff81003194>] ? kernel_thread_helper+0x4/0x10
[ 1202.550363] [<ffffffff81052950>] ? kthread+0x0/0xa0
[ 1202.555346] [<ffffffff81003190>] ? kernel_thread_helper+0x0/0x10
After that I enabled lockdep and retried the same update: This time no
hang, only the same lockdep report that Yang Ruirui had.
So I'm currently at a loss how I should continue from here. The XFS
false positive drowns any other lockdep problems, but XFS had some
hang problems with the new workqueues. (Personally I had no other hang
with earlier 2.6.36-rcs)
As seen in the call trace from the kworker, there are more workqueues
involved, thats why I added Tejun to the CC. My root filesystem is XFS
on dm-crypt on a md/raid1, so it might have been something there.
If you need more information, just ask, I will try to provide it.
Torsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists