linux-kernel - Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080623002415.GB21597@csn.ul.ie>
Date:	Mon, 23 Jun 2008 01:24:15 +0100
From:	Mel Gorman <mel@....ul.ie>
To:	Daniel J Blueman <daniel.blueman@...il.com>,
	Christoph Lameter <clameter@....com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Alexander Beregalov <a.beregalov@...il.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>, xfs@....sgi.com
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On (23/06/08 08:19), Dave Chinner didst pronounce:
> [added xfs@....sgi.com to cc]
> 
> On Sun, Jun 22, 2008 at 10:58:56AM +0100, Daniel J Blueman wrote:
> > I'm seeing a similar issue [2] to what was recently reported [1] by
> > Alexander, but with another workload involving XFS and memory
> > pressure.
> > 
> > SLUB allocator is in use and config is at http://quora.org/config-client-debug .
> > 
> > Let me know if you'd like more details/vmlinux objdump etc.
> > 
> > Thanks,
> >  Daniel
> > 
> > --- [1]
> > 
> > http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c
> > 
> > --- [2]
> > 
> > =======================================================
> > [ INFO: possible circular locking dependency detected ]
> > 2.6.26-rc7-210c #2
> > -------------------------------------------------------
> > AutopanoPro/4470 is trying to acquire lock:
> >  (iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> > 
> > but task is already holding lock:
> >  (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #2 (&mm->mmap_sem){----}:
> >       [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> >       [<ffffffff802793f5>] lock_acquire+0x65/0x90
> >       [<ffffffff805df5ab>] down_read+0x3b/0x70
> >       [<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
> >       [<ffffffff805e16cd>] error_exit+0x0/0xa9
> >       [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> >       [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> >       [<ffffffff802793f5>] lock_acquire+0x65/0x90
> >       [<ffffffff8026d746>] down_write_nested+0x46/0x80
> >       [<ffffffff8039df29>] xfs_ilock+0x99/0xa0
> >       [<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
> >       [<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
> >       [<ffffffff803bc199>] xfs_reclaim+0x109/0x110
> >       [<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
> >       [<ffffffff802d906d>] clear_inode+0x7d/0x110
> >       [<ffffffff802d93aa>] dispose_list+0x2a/0x100
> >       [<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
> >       [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> >       [<ffffffff8029e0b6>] kswapd+0x3b6/0x560
> >       [<ffffffff8026921d>] kthread+0x4d/0x80
> >       [<ffffffff80227428>] child_rip+0xa/0x12
> >       [<ffffffffffffffff>] 0xffffffffffffffff
> 
> You may as well ignore anything invlving this path in XFS until
> lockdep gets fixed. The kswapd reclaim path is inverted over the
> synchronous reclaim path that is xfs_ilock -> run out of memory ->
> prune_icache and then potentially another -> xfs_ilock.
> 

In that case, have you any theory as to why this circular dependency is
being reported now but wasn't before 2.6.26-rc1? I'm beginning to wonder
if the bisecting fingering the zonelist modifiation is just a
co-incidence.

Also, do you think the stalls were happening before but just not being noticed?

> In this case, XFS can *never* deadlock because the second xfs_ilock
> is on a different, unreferenced, unlocked inode, but without turning
> off lockdep there is nothing in XFS that can be done to prevent
> this warning.
> 
> Therxp eis a similar bug in the VM w.r.t the mmap_sem in that the
> mmap_sem is held across a call to put_filp() which can result in
> inversions between the xfs_ilock and mmap_sem.
> 
> Both of these cases cannot be solved by changing XFS - lockdep
> needs to be made aware of paths that can invert normal locking
> order (like prune_icache) so it doesn't give false positives
> like this.
> 
> > -> #0 (iprune_mutex){--..}:
> >       [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> >       [<ffffffff802793f5>] lock_acquire+0x65/0x90
> >       [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> >       [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> >       [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> >       [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> >       [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> >       [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> >       [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> >       [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> >       [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> >       [<ffffffff805e16cd>] error_exit+0x0/0xa9
> >       [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This case is different in that it ??s complaining about mmap_sem vs
> iprune_mutex, so I think that we can pretty much ignore the XFS side
> of things here - the problem is higher level code....
> 
> >  [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> >  [<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
> >  [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> >  [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> >  [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> >  [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> 
> FWIW, should page allocation in a page fault be allowed to recurse
> into the filesystem? If I follow the spaghetti of inline and
> compiler inlined functions correctly, this is a GFP_HIGHUSER_MOVABLE
> allocation, right? Should we be allowing shrink_icache_memory()
> to be called at all in the page fault path?
> 

Well, the page fault path is able to go to sleep and can enter direct
reclaim under low memory situations. Right now, I'm failing to see why a
page fault should not be allowed to reclaim pages in use by a
filesystem. It was allowed before so the question still is why the
circular lock warning appears now but didn't before.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/