linux-kernel - Re: frequent softlockups with 3.10rc6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130628035825.GC29338@dastard>
Date:	Fri, 28 Jun 2013 13:58:25 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Dave Jones <davej@...hat.com>, Oleg Nesterov <oleg@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote:
> On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
> >  
> >  
> >  > > Yup, that's about three of orders of magnitude faster on this
> >  > > workload....
> >  > > 
> >  > > Lightly smoke tested patch below - it passed the first round of
> >  > > XFS data integrity tests in xfstests, so it's not completely
> >  > > busted...
> >  > 
> >  > And now with even less smoke out that the first version. This one
> >  > gets though a full xfstests run...
> > 
> > :sadface:
> > 
> > [  567.680836] ======================================================
> > [  567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> > [  567.682389] 3.10.0-rc7+ #9 Not tainted
> > [  567.682862] ------------------------------------------------------
> > [  567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > [  567.684464]  (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> > [  567.685632] 
> > and this task is already holding:
> > [  567.686334]  (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> > [  567.687506] which would create a new lock dependency:
> > [  567.688115]  (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}
> 
> .....
> 
> > other info that might help us debug this:
> > 
> > [  567.750396]  Possible interrupt unsafe locking scenario:
> > 
> > [  567.752062]        CPU0                    CPU1
> > [  567.753025]        ----                    ----
> > [  567.753981]   lock(&sb->s_type->i_lock_key#3);
> > [  567.754969]                                local_irq_disable();
> > [  567.756085]                                lock(&(&wb->wb_list_lock)->rlock);
> > [  567.757368]                                lock(&sb->s_type->i_lock_key#3);
> > [  567.758642]   <Interrupt>
> > [  567.759370]     lock(&(&wb->wb_list_lock)->rlock);
> 
> Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> loop to use a spin_trylock(&inode->i_lock), moving the inode to
> the end of the sync list, dropping all locks and starting again...

New version below, went through xfstests with lockdep enabled this
time....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com

writeback: store inodes under writeback on a separate list

From: Dave Chinner <dchinner@...hat.com>

When there are lots of cached inodes, a sync(2) operation walks all
of them to try to find which ones are under writeback and wait for
IO completion on them. Run enough load, and this caused catastrophic
lock contention on the inode_sb_list_lock.

Try to fix this problem by tracking inodes under data IO and wait
specifically for only those inodes that haven't completed their data
IO in wait_sb_inodes().

This is a bit hacky and messy, but demonstrates one method of
solving this problem....

XXX: This will catch data IO - do we need to catch actual inode
writeback (i.e. the metadata) here? I'm pretty sure we don't need to
because the existing code just calls filemap_fdatawait() and that
doesn't wait for the inode metadata writeback to occur....

[v3 - avoid deadlock due to interrupt while holding inode->i_lock]

[v2 - needs spin_lock_irq variants in wait_sb_inodes.
    - move freeing inodes back to primary list, we don't wait for
      them
    - take mapping lock in wait_sb_inodes when requeuing.]

Signed-off-by: Dave Chinner <dchinner@...hat.com>
---
 fs/fs-writeback.c           |   70 +++++++++++++++++++++++++++++++++++++------
 fs/inode.c                  |    1 +
 include/linux/backing-dev.h |    3 ++
 include/linux/fs.h          |    3 +-
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |   22 ++++++++++++++
 6 files changed, 91 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..589c40b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
 
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct inode *old_inode = NULL;
+	unsigned long flags;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_sb_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * call, but which had writeout started before we write it out.
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IOi
+	 * after
+	 * we started our wait, so we need to make sure the next sync or IO
+	 * completion treats them correctly, Move them back to the primary list
+	 * and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong
+	 * on any list, and the cleaning of them should have removed them from
+	 * the temporary list. Check this is true, and restart the walk.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
+	list_splice_init(&bdi->wb.b_wb, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_io_list);
 		struct address_space *mapping = inode->i_mapping;
 
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		/*
+		 * we are nesting the inode->i_lock inside a IRQ disabled
+		 * section here, so there's the possibility that we could have
+		 * a lock inversion due to an interrupt while holding the
+		 * inode->i_lock elsewhere. This is the only place we take the
+		 * inode->i_lock inside the wb_list_lock, so we need to use a
+		 * trylock to avoid a deadlock. If we fail to get the lock,
+		 * the only way to make progress is to also drop the
+		 * wb_list_lock so the interrupt trying to get it can make
+		 * progress.
+		 */
+		if (!spin_trylock(&inode->i_lock)) {
+			list_move(&inode->i_io_list, &bdi->wb.b_wb);
+			spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
+			cpu_relax();
+			spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
+			continue;
+		}
+
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			list_move(&inode->i_io_list, &bdi->wb.b_wb);
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -1253,9 +1294,20 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_sb_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary
+		 */
+		spin_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock(&bdi->wb.wb_list_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_io_list));
+			list_move(&inode->i_io_list, &bdi->wb.b_wb);
+		}
+		spin_unlock(&mapping->tree_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
 	iput(old_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 00d5fc3..f25c1ca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..4a6283c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,6 +59,9 @@ struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
+
+	spinlock_t wb_list_lock;	/* writeback list lock */
+	struct list_head b_wb;		/* under writeback, for wait_sb_inodes */
 };
 
 struct backing_dev_info {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63cac31..7861017 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -573,7 +573,8 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
 	struct hlist_node	i_hash;
-	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct list_head	i_wb_list;	/* backing dev WB list */
+	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..896b8f5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_wb);
 	spin_lock_init(&wb->list_lock);
+	spin_lock_init(&wb->wb_list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..4c411fe 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
 			}
+			if (mapping->host &&
+			    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+				struct inode *inode = mapping->host;
+
+				WARN_ON(list_empty(&inode->i_io_list));
+				spin_lock(&bdi->wb.wb_list_lock);
+				list_del_init(&inode->i_io_list);
+				spin_unlock(&bdi->wb.wb_list_lock);
+			}
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
@@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			/* __swap_writepage comes through here */
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+			if (!on_wblist && mapping->host) {
+				struct inode *inode = mapping->host;
+
+				WARN_ON(!list_empty(&inode->i_io_list));
+				spin_lock(&bdi->wb.wb_list_lock);
+				list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
+				spin_unlock(&bdi->wb.wb_list_lock);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/