linux-kernel - Re: [PATCH v3 1/3] lockdep: Make LOCKDEP_CROSSRELEASE configs all part of PROVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170822090603.vufl5ncoequ5ag6x@hirez.programming.kicks-ass.net>
Date:   Tue, 22 Aug 2017 11:06:03 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Dave Chinner <david@...morbit.com>
Cc:     Byungchul Park <byungchul.park@....com>, mingo@...nel.org,
        linux-kernel@...r.kernel.org, kernel-team@....com,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Tejun Heo <tj@...nel.org>, Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH v3 1/3] lockdep: Make LOCKDEP_CROSSRELEASE configs all
 part of PROVE_LOCKING

On Tue, Aug 22, 2017 at 03:46:03PM +1000, Dave Chinner wrote:
> On Mon, Aug 21, 2017 at 05:46:00PM +0200, Peter Zijlstra wrote:

> Let me get this straight, first. If we:
> 
> 	1. take a lock in a work queue context; and
> 	2. in a separate context, hold that lock while we wait for a
> 	completion; and
> 	3. Run the completion from the original workqueue where we
> 	/once/ held that lock
> 
> lockdep will barf?
> 
> IIUC, that's a problem because XFS does this all over the place.
> Let me pull the trace apart in reverse order to explain why XFS is
> going to throw endless false positives on this:

Yeah, and I agree that's not right or desired. Just trying to figure out
how to go about fixing that.

Because, with possible exception for the single threaded workqueues,
there is no actual deadlock there unless its the very same work
instance. Its the classic instance vs class thing.

And splitting up work classes is going to be a nightmare too.

> > [   39.159708] -> #0 ((&ioend->io_work)){+.+.}:
> > [   39.166126]        process_one_work+0x244/0x6b0
> > [   39.171184]        worker_thread+0x48/0x3f0
> > [   39.175845]        kthread+0x147/0x180
> > [   39.180020]        ret_from_fork+0x2a/0x40
> > [   39.184593]        0xffffffffffffffff
> 
> That's a data IO completion, which will take an inode lock if we
> have to run a transaction to update inode size or convert an
> unwritten extent.
> 
> > [   39.100611] -> #1 (&xfs_nondir_ilock_class){++++}:
> > [   39.107612]        __lock_acquire+0x10a1/0x1100
> > [   39.112660]        lock_acquire+0xea/0x1f0
> > [   39.117224]        down_write_nested+0x26/0x60
> > [   39.122184]        xfs_ilock+0x166/0x220
> > [   39.126563]        __xfs_setfilesize+0x30/0x200
> > [   39.131612]        xfs_setfilesize_ioend+0x7f/0xb0
> > [   39.136958]        xfs_end_io+0x49/0xf0
> > [   39.141240]        process_one_work+0x273/0x6b0
> > [   39.146288]        worker_thread+0x48/0x3f0
> > [   39.150960]        kthread+0x147/0x180
> > [   39.155146]        ret_from_fork+0x2a/0x40
> 
> That's the data IO completion locking the inode inside a transaction
> to update the inode size inside a workqueue.
> 
> 
> > [   38.962397] -> #2 ((complete)&bp->b_iowait){+.+.}:
> > [   38.969401]        __lock_acquire+0x10a1/0x1100
> > [   38.974460]        lock_acquire+0xea/0x1f0
> > [   38.979036]        wait_for_completion+0x3b/0x130
> > [   38.984285]        xfs_buf_submit_wait+0xb2/0x590
> > [   38.989535]        _xfs_buf_read+0x23/0x30
> > [   38.994108]        xfs_buf_read_map+0x14f/0x320
> > [   38.999169]        xfs_trans_read_buf_map+0x170/0x5d0
> > [   39.004812]        xfs_read_agf+0x97/0x1d0
> > [   39.009386]        xfs_alloc_read_agf+0x60/0x240
> > [   39.014541]        xfs_alloc_fix_freelist+0x32a/0x3d0
> > [   39.020180]        xfs_free_extent_fix_freelist+0x6b/0xa0
> > [   39.026206]        xfs_free_extent+0x48/0x120
> > [   39.031068]        xfs_trans_free_extent+0x57/0x200
> > [   39.036512]        xfs_extent_free_finish_item+0x26/0x40
> > [   39.042442]        xfs_defer_finish+0x174/0x770
> > [   39.047494]        xfs_itruncate_extents+0x126/0x470
> > [   39.053035]        xfs_setattr_size+0x2a1/0x310
> > [   39.058093]        xfs_vn_setattr_size+0x57/0x160
> > [   39.063342]        xfs_vn_setattr+0x50/0x70
> > [   39.068015]        notify_change+0x2ee/0x420
> > [   39.072785]        do_truncate+0x5d/0x90
> > [   39.077165]        path_openat+0xab2/0xc00
> > [   39.081737]        do_filp_open+0x8a/0xf0
> > [   39.086213]        do_sys_open+0x123/0x200
> > [   39.090787]        SyS_open+0x1e/0x20
> > [   39.094876]        entry_SYSCALL_64_fastpath+0x23/0xc2
> 
> And that's waiting for a metadata read IO to complete during a
> truncate transaction. This has no direct connection to the inode at
> all - it's a allocation group header that we have to lock to do
> block allocation. The completion it is waiting on doesn't even run
> through the same workqueue as the ioends - ioends go through
> mp->m_data_workqueue or mp->m_unwritten_workqueue, metadata buffers
> go through mp->m_buf_workqueue or mp->m_log_workqueue.
> 
> So from that perspective, an ioend blocked on an inode lock should
> not ever prevent metadata buffer completions from being run. Hence
> I'd call this a false positive at best, though I think it really
> indicates a bug in the new lockdep code because it isn't
> discriminating between different workqueue contexts properly.

Agreed. The interaction with workqueues is buggered.

> Even if I ignore the fact that buffer completions are run on
> different workqueues, there seems to be a bigger problem with this
> sort of completion checking.
> 
> That is, the trace looks plausible because we are definitely hold an
> inode locked deep inside a truncate operation where the completion
> if flagged.  Indeed, some transactions that would flag like this
> could be holding up to 5 inodes locked and have tens of other
> metadata objects locked. There are potentially tens (maybe even
> hundreds) of different paths into this IO wait point, and all have
> different combinations of objects locked when it triggers. So
> there's massive scope for potential deadlocks....
> 
> .... and so we must have some way of avoiding this whole class of
> problems that lockdep is unaware of.


So I did the below little hack, which basically wipes the entire lock
history when we start a work and thereby disregards/looses the
dependency on the work 'lock'.

It makes my test box able to boot and build a kernel on XFS, so while I
see what you're saying (I think), it doesn't appear to instantly show.

Should I run xfstests or something to further verify things are OK? Does
that need a scratch partition (I keep forgetting how to run that stuff
:/).

---
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 66011c9f5df3..de91cdce9460 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4756,10 +4756,14 @@ void crossrelease_hist_start(enum xhlock_context_t c)
 {
 	struct task_struct *cur = current;
 
-	if (cur->xhlocks) {
-		cur->xhlock_idx_hist[c] = cur->xhlock_idx;
-		cur->hist_id_save[c] = cur->hist_id;
-	}
+	if (!cur->xhlocks)
+		return;
+
+	if (c == XHLOCK_PROC)
+		invalidate_xhlock(&xhlock(cur->xhlock_idx));
+
+	cur->xhlock_idx_hist[c] = cur->xhlock_idx;
+	cur->hist_id_save[c] = cur->hist_id;
 }
 
 void crossrelease_hist_end(enum xhlock_context_t c)