linux-kernel - Re: spurious -ENOSPC on XFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 2 Feb 2009 12:36:40 -0500 (EST)
From:	Mikulas Patocka <mpatocka@...hat.com>
To:	Dave Chinner <david@...morbit.com>
cc:	Christoph Hellwig <hch@...radead.org>, xfs@....sgi.com,
	linux-kernel@...r.kernel.org
Subject: Re: spurious -ENOSPC on XFS



On Sun, 1 Feb 2009, Dave Chinner wrote:

> On Thu, Jan 29, 2009 at 11:39:00AM -0500, Mikulas Patocka wrote:
> > On Sat, 24 Jan 2009, Dave Chinner wrote:
> > > On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> > > > If I placed
> > > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > > > directly to xfs_flush_device, I got lock dependency warning (though not a 
> > > > real deadlock).
> > > 
> > > Same reason memory reclaim gives lockdep warnings on XFS - we're
> > > recursing into operations that take inode locks while we currently
> > > hold an inode lock.  However, it shouldn't deadlock because
> > > we should ever try to take the iolock on the inode that we current
> > > hold it on.
> > > 
> > > > So synchronous flushing definitely needs some thinking and lock 
> > > > rearchitecting.
> > > 
> > > No, not at all. At most the grabbing of the iolock in
> > > xfs_sync_inodes_ag() needs to become a trylock....
> > 
> > You are wrong, the comments in the code are right. It really
> > deadlocks if it is modified to use synchronous wait for the end of
> > the flush and if the flush is done with xfs_sync_inodes:
> > 
> > xfssyncd      D 0000000000606808     0  4819      2
> > Call Trace:
> >  [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
> >  [0000000000606808] rwsem_down_read_failed+0x24/0x34
> >  [0000000000606848] __down_read+0x30/0x40
> >  [0000000000468a64] down_read_nested+0x48/0x58
> >  [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
> >  [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
> >  [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
> >  [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
> >  [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
> >  [000000000046556c] kthread+0x54/0x88
> >  [000000000042b2a0] kernel_thread+0x3c/0x54
> >  [00000000004653b8] kthreadd+0xac/0x160
> 
> So it is stuck:
> 
> 127                 /*
> 128                  * If we have to flush data or wait for I/O completion
> 129                  * we need to hold the iolock.
> 130                  */
> 131                 if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
> 132     >>>>>>>>            xfs_ilock(ip, XFS_IOLOCK_SHARED);
> 133                         lock_flags |= XFS_IOLOCK_SHARED;
> 134                         error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
> 135                         if (flags & SYNC_IOWAIT)
> 136                                 xfs_ioend_wait(ip);
> 137                 }
> 
> Given that we are stuck on the iolock in xfs_sync_inodes_ag(), I
> suspect you should re-read my comments above about "lock
> re-architecting" ;).

No, if you turn it into a trylock, it will randomly fail if any other 
process is holding the lock for whatever reason. So you will still get 
spurious -ENOSPCs, although with lower probability.

> If you make the xfs_ilock() there xfs_ilock_nowait() and avoid data
> writeback if we don't get the lock the deadlock goes away, right?

But that premature ENOSPC possibility will still exist.

The only right solution that I see is to drop this flushing code at all 
(because it's unfixable), catch -ENOSPCs exactly before you are about to 
return them to Linux VFS, flush at this point (you are not holding any 
locks here) and retry.

There seems to be more bugs with forgetting to flush delayed allocation 
--- you should flush delayed allocation after *every* failed allocate 
attempt, but the code flushes it only after failed delayed allocate 
attempt --- i.e. nondelayed allocators, such as xfs_iomap_write_direct 
(and maybe other places) will still return -ENOSPC without attempting to 
flush other inodes with delayed allocation.

Syncing should be really moved at the topmost place.

Mikulas

> BTW, can you post the patch you are working on?
> 
> Cheers,
> 
> Dave.

This is what I tried: I turned that 500ms wait into a completion:


Use xfs_sync_inodes and not sync_blockdev. sync_blockdev writes dirty 
metadata buffers, it doesn't touch inodes and pages at all.

Also, change that 500ms delay to a wait for completion, so that the
behavior is not dependent on magic timeout values. XFS developers insist
that it can't deadlock (I hope they're right).

Signed-off-by: Mikulas Patocka <mpatocka@...hat.com>

---
 fs/xfs/linux-2.6/xfs_sync.c |   25 +++++++++++++++++++------
 fs/xfs/linux-2.6/xfs_sync.h |    1 +
 2 files changed, 20 insertions(+), 6 deletions(-)

Index: linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.29-rc3-devel.orig/fs/xfs/linux-2.6/xfs_sync.c	2009-01-29 15:49:25.000000000 +0100
+++ linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.c	2009-01-29 16:57:53.000000000 +0100
@@ -389,12 +389,15 @@ xfs_quiesce_attr(
  * - It saves on stack space, which is tight in certain situations
  * - It can be used (with care) as a mechanism to avoid deadlocks.
  * Flushing while allocating in a full filesystem requires both.
+ *
+ * Dave Chinner claims that the deadlock can't happen. -- mikulas
  */
 STATIC void
 xfs_syncd_queue_work(
 	struct xfs_mount *mp,
 	void		*data,
-	void		(*syncer)(struct xfs_mount *, void *))
+	void		(*syncer)(struct xfs_mount *, void *),
+	struct completion *completion)
 {
 	struct bhv_vfs_sync_work *work;
 
@@ -403,6 +406,7 @@ xfs_syncd_queue_work(
 	work->w_syncer = syncer;
 	work->w_data = data;
 	work->w_mount = mp;
+	work->w_completion = completion;
 	spin_lock(&mp->m_sync_lock);
 	list_add_tail(&work->w_list, &mp->m_sync_list);
 	spin_unlock(&mp->m_sync_lock);
@@ -415,6 +419,7 @@ xfs_syncd_queue_work(
  * has failed with ENOSPC and we are in the process of scratching our
  * heads, looking about for more room...
  */
+#if 0
 STATIC void
 xfs_flush_inode_work(
 	struct xfs_mount *mp,
@@ -424,16 +429,20 @@ xfs_flush_inode_work(
 	filemap_flush(inode->i_mapping);
 	iput(inode);
 }
+#endif
 
 void
 xfs_flush_inode(
 	xfs_inode_t	*ip)
 {
+#if 0
 	struct inode	*inode = VFS_I(ip);
+	DECLARE_COMPLETION_ONSTACK(completion);
 
 	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
-	delay(msecs_to_jiffies(500));
+	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work, &completion);
+	wait_for_completion(&completion);
+#endif
 }
 
 /*
@@ -446,7 +455,7 @@ xfs_flush_device_work(
 	void		*arg)
 {
 	struct inode	*inode = arg;
-	sync_blockdev(mp->m_super->s_bdev);
+	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
 	iput(inode);
 }
 
@@ -455,10 +464,11 @@ xfs_flush_device(
 	xfs_inode_t	*ip)
 {
 	struct inode	*inode = VFS_I(ip);
+	DECLARE_COMPLETION_ONSTACK(completion);
 
 	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work);
-	delay(msecs_to_jiffies(500));
+	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work, &completion);
+	wait_for_completion(&completion);
 	xfs_log_force(ip->i_mount, (xfs_lsn_t)0, XFS_LOG_FORCE|XFS_LOG_SYNC);
 }
 
@@ -526,6 +536,8 @@ xfssyncd(
 		list_for_each_entry_safe(work, n, &tmp, w_list) {
 			(*work->w_syncer)(mp, work->w_data);
 			list_del(&work->w_list);
+			if (work->w_completion)
+				complete(work->w_completion);
 			if (work == &mp->m_sync_work)
 				continue;
 			kmem_free(work);
@@ -541,6 +553,7 @@ xfs_syncd_init(
 {
 	mp->m_sync_work.w_syncer = xfs_sync_worker;
 	mp->m_sync_work.w_mount = mp;
+	mp->m_sync_work.w_completion = NULL;
 	mp->m_sync_task = kthread_run(xfssyncd, mp, "xfssyncd");
 	if (IS_ERR(mp->m_sync_task))
 		return -PTR_ERR(mp->m_sync_task);
Index: linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.h
===================================================================
--- linux-2.6.29-rc3-devel.orig/fs/xfs/linux-2.6/xfs_sync.h	2009-01-29 15:52:28.000000000 +0100
+++ linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.h	2009-01-29 15:52:56.000000000 +0100
@@ -25,6 +25,7 @@ typedef struct bhv_vfs_sync_work {
 	struct xfs_mount	*w_mount;
 	void			*w_data;	/* syncer routine argument */
 	void			(*w_syncer)(struct xfs_mount *, void *);
+	struct completion	*w_completion;
 } bhv_vfs_sync_work_t;
 
 #define SYNC_ATTR		0x0001	/* sync attributes */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/