linux-kernel - Re: [PATCH v3 1/3] lockdep: Make LOCKDEP_CROSSRELEASE configs all part of PROVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 23 Aug 2017 10:56:54 +0900
From:   Byungchul Park <byungchul.park@....com>
To:     Dave Chinner <david@...morbit.com>
Cc:     Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
        linux-kernel@...r.kernel.org, kernel-team@....com,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH v3 1/3] lockdep: Make LOCKDEP_CROSSRELEASE configs all
 part of PROVE_LOCKING

On Tue, Aug 22, 2017 at 03:46:03PM +1000, Dave Chinner wrote:
> On Mon, Aug 21, 2017 at 05:46:00PM +0200, Peter Zijlstra wrote:
> > 
> > 
> > Booting the very latest -tip on my test machine gets me the below splat.
> > 
> > Dave, TJ, FYI, lockdep grew annotations for completions; it remembers
> > which locks were taken before we complete() and checks none of those are
> > held while we wait_for_completion().
> > 
> > That is, something like:
> > 
> > 					mutex_lock(&A);
> > 					mutex_unlock(&A);
> > 					complete(&C);
> > 
> > 	mutex_lock(&A);
> > 	wait_for_completion(&C);
> 
> Ok, that seems reasonable...
> 
> > Is now considered a problem. Note that in order for C to link to A it
> > needs to have observed the complete() thread acquiring it after a
> > wait_for_completion(), something like:
> > 
> > 	wait_for_completion(&C)
> > 					mutex_lock(&A);
> > 					mutex_unlock(&A);
> > 					complete(&C);
> 
> Sure.
> 
> > 
> > That is, only locks observed taken between C's wait_for_completion() and
> > it's complete() are considered.
> > 
> > Now given the above observance rule and the fact that the below report
> > is from the complete, the thing that happened appears to be:
> > 
> > 
> > 	lockdep_map_acquire(&work->lockdep_map)
> > 	down_write(&A)
> > 
> > 			down_write(&A)
> > 			wait_for_completion(&C)
> > 
> > 					lockdep_map_acquire(&work_lockdep_map);
> > 					complete(&C)
> > 
> > Which lockdep then puked over because both sides saw the same work
> > class.
> 
> Let me get this straight, first. If we:
> 
> 	1. take a lock in a work queue context; and
> 	2. in a separate context, hold that lock while we wait for a
> 	completion; and
> 	3. Run the completion from the original workqueue where we
> 	/once/ held that lock
> 
> lockdep will barf?

Do you mean the following?

For example:

A work in a workqueue
---------------------
mutex_lock(&A)

A task
------
mutex_lock(&A)
wait_for_completion(&B)

Another work in the workqueue
-----------------------------
complete(&B)

Do you mean this?

In this case, lockdep including crossrelease does not barf if all
manual lockdep_map acquisitions are used correctly.

> IIUC, that's a problem because XFS does this all over the place.
> Let me pull the trace apart in reverse order to explain why XFS is
> going to throw endless false positives on this:
> 
> > [   39.159708] -> #0 ((&ioend->io_work)){+.+.}:
> > [   39.166126]        process_one_work+0x244/0x6b0
> > [   39.171184]        worker_thread+0x48/0x3f0
> > [   39.175845]        kthread+0x147/0x180
> > [   39.180020]        ret_from_fork+0x2a/0x40
> > [   39.184593]        0xffffffffffffffff
> 
> That's a data IO completion, which will take an inode lock if we
> have to run a transaction to update inode size or convert an
> unwritten extent.
> 
> > [   39.100611] -> #1 (&xfs_nondir_ilock_class){++++}:
> > [   39.107612]        __lock_acquire+0x10a1/0x1100
> > [   39.112660]        lock_acquire+0xea/0x1f0
> > [   39.117224]        down_write_nested+0x26/0x60
> > [   39.122184]        xfs_ilock+0x166/0x220
> > [   39.126563]        __xfs_setfilesize+0x30/0x200
> > [   39.131612]        xfs_setfilesize_ioend+0x7f/0xb0
> > [   39.136958]        xfs_end_io+0x49/0xf0
> > [   39.141240]        process_one_work+0x273/0x6b0
> > [   39.146288]        worker_thread+0x48/0x3f0
> > [   39.150960]        kthread+0x147/0x180
> > [   39.155146]        ret_from_fork+0x2a/0x40
> 
> That's the data IO completion locking the inode inside a transaction
> to update the inode size inside a workqueue.
> 
> 
> > [   38.962397] -> #2 ((complete)&bp->b_iowait){+.+.}:
> > [   38.969401]        __lock_acquire+0x10a1/0x1100
> > [   38.974460]        lock_acquire+0xea/0x1f0
> > [   38.979036]        wait_for_completion+0x3b/0x130
> > [   38.984285]        xfs_buf_submit_wait+0xb2/0x590
> > [   38.989535]        _xfs_buf_read+0x23/0x30
> > [   38.994108]        xfs_buf_read_map+0x14f/0x320
> > [   38.999169]        xfs_trans_read_buf_map+0x170/0x5d0
> > [   39.004812]        xfs_read_agf+0x97/0x1d0
> > [   39.009386]        xfs_alloc_read_agf+0x60/0x240
> > [   39.014541]        xfs_alloc_fix_freelist+0x32a/0x3d0
> > [   39.020180]        xfs_free_extent_fix_freelist+0x6b/0xa0
> > [   39.026206]        xfs_free_extent+0x48/0x120
> > [   39.031068]        xfs_trans_free_extent+0x57/0x200
> > [   39.036512]        xfs_extent_free_finish_item+0x26/0x40
> > [   39.042442]        xfs_defer_finish+0x174/0x770
> > [   39.047494]        xfs_itruncate_extents+0x126/0x470
> > [   39.053035]        xfs_setattr_size+0x2a1/0x310
> > [   39.058093]        xfs_vn_setattr_size+0x57/0x160
> > [   39.063342]        xfs_vn_setattr+0x50/0x70
> > [   39.068015]        notify_change+0x2ee/0x420
> > [   39.072785]        do_truncate+0x5d/0x90
> > [   39.077165]        path_openat+0xab2/0xc00
> > [   39.081737]        do_filp_open+0x8a/0xf0
> > [   39.086213]        do_sys_open+0x123/0x200
> > [   39.090787]        SyS_open+0x1e/0x20
> > [   39.094876]        entry_SYSCALL_64_fastpath+0x23/0xc2
> 
> And that's waiting for a metadata read IO to complete during a
> truncate transaction. This has no direct connection to the inode at
> all - it's a allocation group header that we have to lock to do
> block allocation. The completion it is waiting on doesn't even run
> through the same workqueue as the ioends - ioends go through
> mp->m_data_workqueue or mp->m_unwritten_workqueue, metadata buffers
> go through mp->m_buf_workqueue or mp->m_log_workqueue.

In this cases, a lockdep_map class for the former and one for the latter
must be different from each other. Not only one for workqueue but also
one for work must be different which can be same currently, IMHO, it's
wrong. What we should fix is that.

Or, if it's intended by workqueue, xfs code should be fixed so that
different works, not only workqueues, are used for different work.

> So from that perspective, an ioend blocked on an inode lock should
> not ever prevent metadata buffer completions from being run. Hence
> I'd call this a false positive at best, though I think it really
> indicates a bug in the new lockdep code because it isn't

IMHO, it's not a bug of lockdep but lockdep_map_acquire users.

> discriminating between different workqueue contexts properly.

We have to choose one between the following:

   1. Make works' lock class of different workqueues different
   2. Make xfs's works and workqueue for different works different

IMHO, we should choose 1, but I'm not sure since I'm not familiar with
workqueue.

> Even if I ignore the fact that buffer completions are run on
> different workqueues, there seems to be a bigger problem with this
> sort of completion checking.

It does not work as you concern.