linux-kernel - Re: shared/298 lockdep splat?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170920222256.GR18948@dastard>
Date:   Thu, 21 Sep 2017 08:22:56 +1000
From:   Dave Chinner <david@...morbit.com>
To:     "Darrick J. Wong" <darrick.wong@...cle.com>
Cc:     xfs <linux-xfs@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Byungchul Park <byungchul.park@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: shared/298 lockdep splat?

[cc lkml, PeterZ and Byungchul]

On Wed, Sep 20, 2017 at 02:10:42PM -0700, Darrick J. Wong wrote:
> Hello list,
> 
> Yesterday I tried setting up qemu 2.10 with some (fake) nvdimms backed
> by an on-disk file.  Midway through a -g auto xfstests run, shared/298
> produced the attached dmesg spew.  I'll try to have a look later, but
> in the meantime I'm doing the 'complain to list, see if anyone bites'
> thing. :)
> 
> The kernel is 4.14-rc1 without any patches applied.

tl;dr: it's a false positive because it's conflating the independent
locking/completion contexts of upper and lower filesystems when
the loop device is in use.

I've mangled the report in replying - Darrick posted it here, too:

https://pastebin.com/GshLpH7x

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G        W      
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
>  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
> 
> but now in release context of a crosslock acquired at the following:
>  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 ((complete)&ret.event){+.+.}:
>        lock_acquire+0xab/0x200
>        wait_for_completion_io+0x4e/0x1a0
>        submit_bio_wait+0x7f/0xb0
>        blkdev_issue_zeroout+0x71/0xa0
>        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>        xfs_bmapi_write+0x374/0x11f0 [xfs]
>        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>        iomap_apply+0x43/0xe0
>        dax_iomap_rw+0x89/0xf0
>        xfs_file_dax_write+0xcc/0x220 [xfs]
>        xfs_file_write_iter+0xf0/0x130 [xfs]
>        __vfs_write+0xd9/0x150
>        vfs_write+0xc8/0x1c0
>        SyS_write+0x45/0xa0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe

This is on the lower filesystem - it's a write() syscall on a DAX
enable filesystem. Locking in this case is:

Context			function		lock
Lower filesystem	xfs_file_dax_write	XFS_IOLOCK_EXCL
Lower filesystem	xfs_file_iomap_begin	XFS_ILOCK_EXCL
Lower filesystem/
   lower blkdev		submit_bio_wait		wait for dax block
						device IO completion

Which gives: IOLOCK -> &xfs_nondir_ilock_class -> IO completion

In the lower filesystem.

> -> #1 (&xfs_nondir_ilock_class){++++}:
>        lock_acquire+0xab/0x200
>        down_write_nested+0x4a/0xb0
>        xfs_ilock+0x263/0x330 [xfs]
>        xfs_setattr_size+0x152/0x370 [xfs]
>        xfs_vn_setattr+0x6b/0x90 [xfs]
>        notify_change+0x27d/0x3f0
>        do_truncate+0x5b/0x90
>        path_openat+0x237/0xa90
>        do_filp_open+0x8a/0xf0
>        do_sys_open+0x11c/0x1f0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe

Context			function		lock
Lower filesystem	xfs_vn_setattr		XFS_IOLOCK_EXCL
Lower filesystem	xfs_vn_setattr		XFS_MMAPLOCK_EXCL
Lower filesystem	xfs_setattr_size	XFS_ILOCK_EXCL

Which gives: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class

i.e.: IOLOCK -> MMAPLOCK -> &xfs_nondir_ilock_class -> IO completion

In the lower filesystem.

> 
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
>        up_write+0x1c/0x40
>        xfs_iunlock+0x1d0/0x310 [xfs]
>        xfs_file_fallocate+0x8a/0x310 [xfs]
>        loop_queue_work+0xb7/0x8d0
>        kthread_worker_fn+0xb9/0x1f0

Context			function		lock
Lower filesystem	xfs_file_fallocate	XFS_IOLOCK_EXCL
Lower filesystem	xfs_file_fallocate	XFS_MMAPLOCK_EXCL
[Lower filesystem	whatever falloc calls	XFS_ILOCK_EXCL]
Upper filesystem/
   Upper blkd		loop_queue_work		blk_mq_complete_request

This is clearer when the reported stack is looked at:

> lock_commit_crosslock+0x3e9/0x5c0
> complete+0x24/0x60
> blk_update_request+0xc2/0x3e0
> blk_mq_end_request+0x18/0x80
> __blk_mq_complete_request+0x9f/0x170
> loop_queue_work+0x51/0x8d0
> kthread_worker_fn+0xb9/0x1f0
> kthread+0x148/0x180
> ret_from_fork+0x2a/0x40

So this is reporting as:

> Chain exists of:
>   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
>  Possible unsafe locking scenario by crosslock:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&xfs_nondir_ilock_class);
>   lock((complete)&ret.event);
>                                lock(&(&ip->i_mmaplock)->mr_lock);
>                                unlock((complete)&ret.event);
>
>                *** DEADLOCK ***

Which effectively means:

	IO completion -> XFS_MMAPLOCK_EXCL .....

The thing is, this IO completion has nothing to do with the lower
filesystem - it's the IO completion for the filesystem on the loop
device (the upper filesystem) and is not in any way related to the
IO completion from the dax device the lower filesystem is waiting
on.

IOWs, this is a false positive.

Peter, this is the sort of false positive I mentioned were likely to
occur without some serious work to annotate the IO stack to prevent
them.  We can nest multiple layers of IO completions and locking in
the IO stack via things like loop and RAID devices.  They can be
nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
checking is going to be a source of false positives until there is
an effective (and simple!) way of providing context based completion
annotations to avoid them...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com