linux-kernel - Re: frequent softlockups with 3.10rc6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130628102819.GA4725@quack.suse.cz>
Date:	Fri, 28 Jun 2013 12:28:19 +0200
From:	Jan Kara <jack@...e.cz>
To:	Dave Chinner <david@...morbit.com>
Cc:	Dave Jones <davej@...hat.com>, Oleg Nesterov <oleg@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On Fri 28-06-13 13:58:25, Dave Chinner wrote:
> On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote:
> > > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote:
> > >  
> > >  
> > >  > > Yup, that's about three of orders of magnitude faster on this
> > >  > > workload....
> > >  > > 
> > >  > > Lightly smoke tested patch below - it passed the first round of
> > >  > > XFS data integrity tests in xfstests, so it's not completely
> > >  > > busted...
> > >  > 
> > >  > And now with even less smoke out that the first version. This one
> > >  > gets though a full xfstests run...
> > > 
> > > :sadface:
> > > 
> > > [  567.680836] ======================================================
> > > [  567.681582] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> > > [  567.682389] 3.10.0-rc7+ #9 Not tainted
> > > [  567.682862] ------------------------------------------------------
> > > [  567.683607] trinity-child2/8665 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > > [  567.684464]  (&sb->s_type->i_lock_key#3){+.+...}, at: [<ffffffff811d74e5>] sync_inodes_sb+0x225/0x3b0
> > > [  567.685632] 
> > > and this task is already holding:
> > > [  567.686334]  (&(&wb->wb_list_lock)->rlock){..-...}, at: [<ffffffff811d7451>] sync_inodes_sb+0x191/0x3b0
> > > [  567.687506] which would create a new lock dependency:
> > > [  567.688115]  (&(&wb->wb_list_lock)->rlock){..-...} -> (&sb->s_type->i_lock_key#3){+.+...}
> > 
> > .....
> > 
> > > other info that might help us debug this:
> > > 
> > > [  567.750396]  Possible interrupt unsafe locking scenario:
> > > 
> > > [  567.752062]        CPU0                    CPU1
> > > [  567.753025]        ----                    ----
> > > [  567.753981]   lock(&sb->s_type->i_lock_key#3);
> > > [  567.754969]                                local_irq_disable();
> > > [  567.756085]                                lock(&(&wb->wb_list_lock)->rlock);
> > > [  567.757368]                                lock(&sb->s_type->i_lock_key#3);
> > > [  567.758642]   <Interrupt>
> > > [  567.759370]     lock(&(&wb->wb_list_lock)->rlock);
> > 
> > Oh, that's easy enough to fix. It's just changing the wait_sb_inodes
> > loop to use a spin_trylock(&inode->i_lock), moving the inode to
> > the end of the sync list, dropping all locks and starting again...
> 
> New version below, went through xfstests with lockdep enabled this
> time....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com
> 
> writeback: store inodes under writeback on a separate list
> 
> From: Dave Chinner <dchinner@...hat.com>
> 
> When there are lots of cached inodes, a sync(2) operation walks all
> of them to try to find which ones are under writeback and wait for
> IO completion on them. Run enough load, and this caused catastrophic
> lock contention on the inode_sb_list_lock.
> 
> Try to fix this problem by tracking inodes under data IO and wait
> specifically for only those inodes that haven't completed their data
> IO in wait_sb_inodes().
> 
> This is a bit hacky and messy, but demonstrates one method of
> solving this problem....
> 
> XXX: This will catch data IO - do we need to catch actual inode
> writeback (i.e. the metadata) here? I'm pretty sure we don't need to
> because the existing code just calls filemap_fdatawait() and that
> doesn't wait for the inode metadata writeback to occur....
> 
> [v3 - avoid deadlock due to interrupt while holding inode->i_lock]
> 
> [v2 - needs spin_lock_irq variants in wait_sb_inodes.
>     - move freeing inodes back to primary list, we don't wait for
>       them
>     - take mapping lock in wait_sb_inodes when requeuing.]
> 
> Signed-off-by: Dave Chinner <dchinner@...hat.com>
> ---
>  fs/fs-writeback.c           |   70 +++++++++++++++++++++++++++++++++++++------
>  fs/inode.c                  |    1 +
>  include/linux/backing-dev.h |    3 ++
>  include/linux/fs.h          |    3 +-
>  mm/backing-dev.c            |    2 ++
>  mm/page-writeback.c         |   22 ++++++++++++++
>  6 files changed, 91 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 3be5718..589c40b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1208,7 +1208,10 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>  
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	struct inode *old_inode = NULL;
> +	unsigned long flags;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1216,7 +1219,6 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	spin_lock(&inode_sb_list_lock);
>  
>  	/*
>  	 * Data integrity sync. Must wait for all pages under writeback,
> @@ -1224,19 +1226,58 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 * call, but which had writeout started before we write it out.
>  	 * In which case, the inode may not be on the dirty list, but
>  	 * we still have to wait for that writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IOi
> +	 * after
> +	 * we started our wait, so we need to make sure the next sync or IO
> +	 * completion treats them correctly, Move them back to the primary list
> +	 * and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong
> +	 * on any list, and the cleaning of them should have removed them from
> +	 * the temporary list. Check this is true, and restart the walk.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> +	list_splice_init(&bdi->wb.b_wb, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_io_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> -		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		/*
> +		 * we are nesting the inode->i_lock inside a IRQ disabled
> +		 * section here, so there's the possibility that we could have
> +		 * a lock inversion due to an interrupt while holding the
> +		 * inode->i_lock elsewhere. This is the only place we take the
> +		 * inode->i_lock inside the wb_list_lock, so we need to use a
> +		 * trylock to avoid a deadlock. If we fail to get the lock,
> +		 * the only way to make progress is to also drop the
> +		 * wb_list_lock so the interrupt trying to get it can make
> +		 * progress.
> +		 */
> +		if (!spin_trylock(&inode->i_lock)) {
> +			list_move(&inode->i_io_list, &bdi->wb.b_wb);
> +			spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
> +			cpu_relax();
> +			spin_lock_irqsave(&bdi->wb.wb_list_lock, flags);
> +			continue;
> +		}
> +
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			list_move(&inode->i_io_list, &bdi->wb.b_wb);
>  			spin_unlock(&inode->i_lock);
>  			continue;
>  		}
  Ugh, the locking looks ugly. Plus the list handling is buggy because the
first wait_sb_inodes() invocation will move all inodes to its private
sync_list so if there's another wait_sb_inodes() invocation racing with it,
it won't wait properly for all the inodes it should.

Won't it be easier to remove inodes from b_wb list (btw, I'd slightly
prefer name b_writeback) lazily instead of from
test_clear_page_writeback()? I mean we would remove inodes from b_wb list
only in wait_sb_inodes() or when inodes get reclaimed from memory. That way
we save some work in test_clear_page_writeback() which is a fast path and
defer it to sync which isn't that performance critical. Also we would avoid
that ugly games with irq safe spinlocks.

>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>  
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
> @@ -1253,9 +1294,20 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&inode_sb_list_lock);
> +		/*
> +		 * the inode has been written back now, so check whether we
> +		 * still have pages under IO and move it back to the primary
> +		 * list if necessary
> +		 */
> +		spin_lock_irqsave(&mapping->tree_lock, flags);
> +		spin_lock(&bdi->wb.wb_list_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			WARN_ON(list_empty(&inode->i_io_list));
> +			list_move(&inode->i_io_list, &bdi->wb.b_wb);
> +		}
> +		spin_unlock(&mapping->tree_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock_irqrestore(&bdi->wb.wb_list_lock, flags);
>  	iput(old_inode);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 00d5fc3..f25c1ca 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -364,6 +364,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_wb_list);
> +	INIT_LIST_HEAD(&inode->i_io_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..4a6283c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -59,6 +59,9 @@ struct bdi_writeback {
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
>  	spinlock_t list_lock;		/* protects the b_* lists */
> +
> +	spinlock_t wb_list_lock;	/* writeback list lock */
> +	struct list_head b_wb;		/* under writeback, for wait_sb_inodes */
>  };
>  
>  struct backing_dev_info {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63cac31..7861017 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -573,7 +573,8 @@ struct inode {
>  	unsigned long		dirtied_when;	/* jiffies of first dirtying */
>  
>  	struct hlist_node	i_hash;
> -	struct list_head	i_wb_list;	/* backing dev IO list */
> +	struct list_head	i_wb_list;	/* backing dev WB list */
> +	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
>  	union {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5025174..896b8f5 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -421,7 +421,9 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_wb);
>  	spin_lock_init(&wb->list_lock);
> +	spin_lock_init(&wb->wb_list_lock);
>  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..4c411fe 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2238,6 +2238,15 @@ int test_clear_page_writeback(struct page *page)
>  				__dec_bdi_stat(bdi, BDI_WRITEBACK);
>  				__bdi_writeout_inc(bdi);
>  			}
> +			if (mapping->host &&
> +			    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +				struct inode *inode = mapping->host;
> +
> +				WARN_ON(list_empty(&inode->i_io_list));
> +				spin_lock(&bdi->wb.wb_list_lock);
> +				list_del_init(&inode->i_io_list);
> +				spin_unlock(&bdi->wb.wb_list_lock);
> +			}
>  		}
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  	} else {
> @@ -2262,11 +2271,24 @@ int test_set_page_writeback(struct page *page)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			/* __swap_writepage comes through here */
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> +			if (!on_wblist && mapping->host) {
> +				struct inode *inode = mapping->host;
> +
> +				WARN_ON(!list_empty(&inode->i_io_list));
> +				spin_lock(&bdi->wb.wb_list_lock);
> +				list_add_tail(&inode->i_io_list, &bdi->wb.b_wb);
> +				spin_unlock(&bdi->wb.wb_list_lock);
> +			}
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
  I'm somewhat uneasy about this. Writeback code generally uses
inode_to_bdi() function to get from the mapping to backing_dev_info (which
uses inode->i_sb->s_bdi except for inodes on blockdev_superblock). That isn't
always the same as inode->i_mapping->backing_dev_info used here although
I now fail to remember a case where inode->i_mapping->backing_dev_info
would be a wrong bdi to use for sync purposes.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/