linux-kernel - Re: 2.6.26-rc1: possible circular locking dependency with xfs filesystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080511231002.GN103491721@sgi.com>
Date:	Mon, 12 May 2008 09:10:02 +1000
From:	David Chinner <dgc@....com>
To:	Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>
Cc:	pvp-lsts@...ru.acad.bg,
	Alexander Beregalov <a.beregalov@...il.com>,
	kernel-testers@...r.kernel.org,
	kernel list <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...e.hu>, peterz@...radead.org,
	xfs@....sgi.com, David Chinner <dgc@....com>
Subject: Re: 2.6.26-rc1: possible circular locking dependency with xfs filesystem

On Sun, May 11, 2008 at 09:18:07AM +0530, Kamalesh Babulal wrote:
> Kamalesh Babulal wrote:
> > Adding the cc to kernel-list, Ingo Molnar and Peter Zijlstra
> > 
> > Alexander Beregalov wrote:
> >> [ INFO: possible circular locking dependency detected ]
> >> 2.6.26-rc1-00279-g28a4acb #13
> >> -------------------------------------------------------
> >> nfsd/3087 is trying to acquire lock:
> >>  (iprune_mutex){--..}, at: [<c016f947>] shrink_icache_memory+0x38/0x19b
> >>
> >> but task is already holding lock:
> >>  (&(&ip->i_iolock)->mr_lock){----}, at: [<c0210b83>] xfs_ilock+0xa2/0xd6
> >>
> >> which lock already depends on the new lock.
> >>
> >>
> >> the existing dependency chain (in reverse order) is:
> >>
> >> -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> >>        [<c01352e6>] __lock_acquire+0xa0c/0xbc6
> >>        [<c013550a>] lock_acquire+0x6a/0x86
> >>        [<c012c39a>] down_write_nested+0x33/0x6a
> >>        [<c0210b5c>] xfs_ilock+0x7b/0xd6
> >>        [<c0210cd5>] xfs_ireclaim+0x1d/0x59
> >>        [<c022edfe>] xfs_finish_reclaim+0x173/0x195
> >>        [<c0230fa3>] xfs_reclaim+0xb3/0x138
> >>        [<c023b4cb>] xfs_fs_clear_inode+0x55/0x8e
> >>        [<c016f60b>] clear_inode+0x83/0xd2
> >>        [<c016f88a>] dispose_list+0x3c/0xc1
> >>        [<c016fa82>] shrink_icache_memory+0x173/0x19b
> >>        [<c014a68d>] shrink_slab+0xda/0x14e
> >>        [<c014a8e5>] try_to_free_pages+0x1e4/0x2a2
> >>        [<c0146997>] __alloc_pages_internal+0x23a/0x39d
> >>        [<c0146b11>] __alloc_pages+0xa/0xc
> >>        [<c01483b2>] __do_page_cache_readahead+0xaa/0x16a
> >>        [<c01484bc>] force_page_cache_readahead+0x4a/0x74
> >>        [<c014c9b0>] sys_madvise+0x308/0x400
> >>        [<c0102b25>] sysenter_past_esp+0x6a/0xb1
> >>        [<ffffffff>] 0xffffffff
> >>
> >> -> #0 (iprune_mutex){--..}:
> >>        [<c0135203>] __lock_acquire+0x929/0xbc6
> >>        [<c013550a>] lock_acquire+0x6a/0x86
> >>        [<c0356a6f>] mutex_lock_nested+0xb4/0x226
> >>        [<c016f947>] shrink_icache_memory+0x38/0x19b
> >>        [<c014a68d>] shrink_slab+0xda/0x14e
> >>        [<c014a8e5>] try_to_free_pages+0x1e4/0x2a2
> >>        [<c0146997>] __alloc_pages_internal+0x23a/0x39d
> >>        [<c0146b11>] __alloc_pages+0xa/0xc
> >>        [<c01483b2>] __do_page_cache_readahead+0xaa/0x16a
> >>        [<c014866c>] ondemand_readahead+0x119/0x127
> >>        [<c01486cc>] page_cache_async_readahead+0x52/0x5d
> >>        [<c0178e46>] generic_file_splice_read+0x290/0x4a8
> >>        [<c0239f06>] xfs_splice_read+0x4b/0x78
> >>        [<c0237713>] xfs_file_splice_read+0x24/0x29
> >>        [<c0178182>] do_splice_to+0x45/0x63
> >>        [<c01783f6>] splice_direct_to_actor+0xab/0x150
> >>        [<c01ce8e1>] nfsd_vfs_read+0x1ed/0x2d0
> >>        [<c01ced50>] nfsd_read+0x82/0x99
> >>        [<c01d42bc>] nfsd3_proc_read+0xdf/0x12a
> >>        [<c01cb40b>] nfsd_dispatch+0xcf/0x19e
> >>        [<c033f484>] svc_process+0x3b3/0x68b
> >>        [<c01cb939>] nfsd+0x168/0x26b
> >>        [<c0103747>] kernel_thread_helper+0x7/0x10
> >>        [<ffffffff>] 0xffffffff

Oh, yeah, that. Direct inode reclaim through memory pressure.

Effectively memory reclaim inverts locking order w.r.t. iprune_mutex
when it recurses into the filesystem. False positive - can never
cause a deadlock on XFS. Can't be solved from the XFS side of things
without effectively turning off lockdep checking for xfs inode
locking.

The fix is needed to lockdep via iprune_mutex annotations here....

> May  9 02:16:46 nomad64 kernel: [42951853.992965] the existing dependency chain (in reverse order) is:
> May  9 02:16:46 nomad64 kernel: [42951853.992967] 
> May  9 02:16:46 nomad64 kernel: [42951853.992968] -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> May  9 02:16:46 nomad64 kernel: [42951853.992974]        [<ffffffff80261d72>] __lock_acquire+0xf92/0x1080
> May  9 02:16:46 nomad64 kernel: [42951853.992989]        [<ffffffff80261f02>] lock_acquire+0xa2/0xd0
> May  9 02:16:46 nomad64 kernel: [42951853.993002]        [<ffffffff80255556>] down_write_nested+0x46/0x80
> May  9 02:16:46 nomad64 kernel: [42951853.993018]        [<ffffffff80387fb9>] xfs_ilock+0x99/0xa0
> May  9 02:16:46 nomad64 kernel: [42951853.993034]        [<ffffffff803a5117>] xfs_free_eofblocks+0x1c7/0x250
> May  9 02:16:46 nomad64 kernel: [42951853.993049]        [<ffffffff803a8a26>] xfs_release+0x186/0x1d0
> May  9 02:16:46 nomad64 kernel: [42951853.993062]        [<ffffffff803aeeb0>] xfs_file_release+0x10/0x20
> May  9 02:16:46 nomad64 kernel: [42951853.993076]        [<ffffffff802a01cc>] __fput+0xcc/0x1c0
> May  9 02:16:46 nomad64 kernel: [42951853.993091]        [<ffffffff802a05e6>] fput+0x16/0x20
> May  9 02:16:46 nomad64 kernel: [42951853.993105]        [<ffffffff8028865a>] remove_vma+0x4a/0x80
> May  9 02:16:46 nomad64 kernel: [42951853.993120]        [<ffffffff802894e1>] do_munmap+0x281/0x2e0
> May  9 02:16:46 nomad64 kernel: [42951853.993134]        [<ffffffff8028958b>] sys_munmap+0x4b/0x70
> May  9 02:16:46 nomad64 kernel: [42951853.993148]        [<ffffffff8020b62b>] system_call_after_swapgs+0x7b/0x80
> May  9 02:16:46 nomad64 kernel: [42951853.993161]        [<ffffffffffffffff>] 0xffffffffffffffff

hmmmm. Sounds like:

	fd = open()
	addr = mmap(fd)
	close(fd)
	.....
	munmap(addr);

But yes, XFS takes locks in ->release which means.....

> May  9 02:16:46 nomad64 kernel: [42951853.993293] Call Trace:
> May  9 02:16:46 nomad64 kernel: [42951853.993297]  [<ffffffff8025f2b3>] print_circular_bug_tail+0x83/0x90
> May  9 02:16:46 nomad64 kernel: [42951853.993302]  [<ffffffff80261b90>] __lock_acquire+0xdb0/0x1080
> May  9 02:16:46 nomad64 kernel: [42951853.993306]  [<ffffffff80222bbd>] ? do_page_fault+0xdd/0x890
> May  9 02:16:46 nomad64 kernel: [42951853.993310]  [<ffffffff80261f02>] lock_acquire+0xa2/0xd0
> May  9 02:16:46 nomad64 kernel: [42951853.993313]  [<ffffffff80222bbd>] ? do_page_fault+0xdd/0x890
> May  9 02:16:46 nomad64 kernel: [42951853.993317]  [<ffffffff806b887b>] down_read+0x3b/0x70
> May  9 02:16:46 nomad64 kernel: [42951853.993320]  [<ffffffff80222bbd>] do_page_fault+0xdd/0x890
> May  9 02:16:46 nomad64 kernel: [42951853.993324]  [<ffffffff806ba5dd>] error_exit+0x0/0xa9
> May  9 02:16:46 nomad64 kernel: [42951853.993328]  [<ffffffff802739b6>] ? file_read_actor+0x46/0x1b0
> May  9 02:16:46 nomad64 kernel: [42951853.993331]  [<ffffffff806ba3d6>] ? _read_unlock_irq+0x36/0x60
> May  9 02:16:46 nomad64 kernel: [42951853.993335]  [<ffffffff80275dbc>] ? generic_file_aio_read+0x2cc/0x5d0
> May  9 02:16:46 nomad64 kernel: [42951853.993339]  [<ffffffff8025ddb9>] ? get_lock_stats+0x19/0x70
> May  9 02:16:46 nomad64 kernel: [42951853.993343]  [<ffffffff803b2769>] ? xfs_read+0x139/0x220
> May  9 02:16:46 nomad64 kernel: [42951853.993347]  [<ffffffff803af06d>] ? xfs_file_aio_read+0x4d/0x60
> May  9 02:16:46 nomad64 kernel: [42951853.993350]  [<ffffffff8029eeb1>] ? do_sync_read+0xf1/0x130
> May  9 02:16:46 nomad64 kernel: [42951853.993354]  [<ffffffff802516e0>] ? autoremove_wake_function+0x0/0x40
> May  9 02:16:46 nomad64 kernel: [42951853.993358]  [<ffffffff8026089a>] ? trace_hardirqs_on+0xda/0x170
> May  9 02:16:46 nomad64 kernel: [42951853.993361]  [<ffffffff80272e45>] ? __rcu_read_unlock+0xb5/0xc0
> May  9 02:16:46 nomad64 kernel: [42951853.993365]  [<ffffffff8026089a>] ? trace_hardirqs_on+0xda/0x170
> May  9 02:16:46 nomad64 kernel: [42951853.993369]  [<ffffffff803c4381>] ? security_file_permission+0x11/0x20
> May  9 02:16:46 nomad64 kernel: [42951853.993374]  [<ffffffff8029f794>] ? vfs_read+0xc4/0x160
> May  9 02:16:46 nomad64 kernel: [42951853.993377]  [<ffffffff8029fc30>] ? sys_read+0x50/0x90
> May  9 02:16:46 nomad64 kernel: [42951853.993380]  [<ffffffff8020b62b>] ? system_call_after_swapgs+0x7b/0x80

Oh, joy - a page fault during a read() call triggers lock order
inversions on the mmap->sem. I don't think this can deadlock
(can't be page faulting in a vma that is being torn down), but
it's clear from the last trace that the VM has a mmap->sem
inversion problem with ->release vs ->read and page faults...

Basically what we are seeing here in both cases is that the VM is
calling inode ->release or ->clear_inode methods with different high
level locks held. If the filesystem has to take the same locks in
these methods as it does in, say, ->read (like XFS does), then we
are guaranteed to get reports like this. AFAICT there's nothing we
can do from the filesystem perspective to prevent false positives like
this from being reported....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/