linux-kernel - Re: frequent softlockups with 3.10rc6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 28 Jun 2013 11:49:15 +0200
From:	Jan Kara <jack@...e.cz>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Dave Chinner <david@...morbit.com>,
	Al Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
	Dave Jones <davej@...hat.com>, Oleg Nesterov <oleg@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On Thu 27-06-13 19:59:50, Linus Torvalds wrote:
> On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner <david@...morbit.com> wrote:
> > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote:
> >>
> >> So what made it all start happening now? I don't recall us having had
> >> these kinds of issues before..
> >
> > Not sure - it's a sudden surprise for me, too. Then again, I haven't
> > been looking at sync from a performance or lock contention point of
> > view any time recently.  The algorithm that wait_sb_inodes() is
> > effectively unchanged since at least 2009, so it's probably a case
> > of it having been protected from contention by some external factor
> > we've fixed/removed recently.  Perhaps the bdi-flusher thread
> > replacement in -rc1 has changed the timing sufficiently that it no
> > longer serialises concurrent sync calls as much....
> >
> > However, the inode_sb_list_lock is known to be a badly contended
> > lock from a create/unlink fastpath for XFS, so it's not like this sort
> > of thing is completely unexpected.
> 
> That whole inode_sb_list_lock seems moronic. Why isn't it a per-sb
> one? No, that won't fix all problems, but it might at least help a
> *bit*.
> 
> Also, looking some more now at that wait_sb_inodes logic, I have to
> say that if the problem is primarily the inode->i_lock, then that's
> just crazy. We normally shouldn't even *need* that lock, since we
> could do a totally unlocked iget() as long as the count is non-zero.
> 
> And no, I don't think really need the i_lock for checking
> "mapping->nrpages == 0" or the magical "inode is being freed" bits
> either. Or at least we could easily do some of this optimistically for
> the common cases.
> 
> So instead of doing
> 
>                 struct address_space *mapping = inode->i_mapping;
> 
>                 spin_lock(&inode->i_lock);
>                 if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>                     (mapping->nrpages == 0)) {
>                         spin_unlock(&inode->i_lock);
>                         continue;
>                 }
>                 __iget(inode);
>                 spin_unlock(&inode->i_lock);
> 
> I really think we could do that without getting the inode lock at
> *all* in the common case.
> 
> I'm attaching a pretty trivial patch, which may obviously be trivially
> totally flawed. I have not tested this in any way, but half the new
> lines are comments about why it's doing what it is doing.  And I
> really think that it should make the "actually take the inode lock" be
> something quite rare.
> 
> And quite frankly, I'd much rather get *rid* of crazy i_lock accesses,
> than try to be clever and use a whole different list at this point.
> Not that I disagree that it wouldn't be much nicer to use a separate
> list in the long run, but for a short-term solution I'd much rather
> keep the old logic and just tweak it to be much more usable..
> 
> Hmm? Al? Jan? Comments?
  Yeah, the patch looks good to me so if it helps Dave with his softlockups
I also think it's a safer alternative than Dave's patch for 3.10. BTW, one
suggestion for improvement below:

 fs/fs-writeback.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..3dcc8b202a40 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1206,6 +1206,52 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * Do we want to get the inode for writeback?
+ */
+static int get_inode_for_writeback(struct inode *inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	/*
+	 * It's a data integrity sync, but we don't care about
+	 * racing with new pages - we're about data integrity
+	 * of things in the past, not the future
+	 */
+	if (!ACCESS_ONCE(mapping->nrpages))
+		return 0;
  I think we can change the above condition to:
	if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
		return 0;

  That should make us skip most of the inodes in the case Dave Chinner was
testing.

								Honza
+
+	/* Similar logic wrt the I_NEW bit */
+	if (ACCESS_ONCE(inode->i_state) & I_NEW)
+		return 0;
+
+	/*
+	 * When the inode count goes down to zero, the
+	 * I_WILL_FREE and I_FREEING bits might get set.
+	 * But not before.
+	 *
+	 * So if we get this, we know those bits are
+	 * clear, and the inode is still interesting.
+	 */
+	if (atomic_inc_not_zero(&inode->i_count))
+		return 1;
+
+	/*
+	 * Slow path never happens normally, since any
+	 * active inode will be referenced by a dentry
+	 * and thus caught above
+	 */
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    (mapping->nrpages == 0)) {
+		spin_unlock(&inode->i_lock);
+		return 0;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	return 1;
+}
+
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
@@ -1226,16 +1272,8 @@ static void wait_sb_inodes(struct super_block *sb)
 	 * we still have to wait for that writeout.
 	 */
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
-
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
+		if (!get_inode_for_writeback(inode))
 			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_sb_list_lock);
 
 		/*
@@ -1249,7 +1287,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		iput(old_inode);
 		old_inode = inode;
 
-		filemap_fdatawait(mapping);
+		filemap_fdatawait(inode->i_mapping);
 
 		cond_resched();
 


-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/