linux-ext4 - Re: ENOSPC returned during writepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1219274535.7895.55.camel@mingming-laptop>
Date:	Wed, 20 Aug 2008 16:22:15 -0700
From:	Mingming Cao <cmm@...ibm.com>
To:	Theodore Tso <tytso@....edu>
Cc:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	ext4 development <linux-ext4@...r.kernel.org>
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 13:56 -0700，Mingming Cao写道：
> 在 2008-08-20三的 07:53 -0400，Theodore Tso写道：
> > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> > 
> > We don't actually lose the data if free blocks are subsequently made
> > available, correct?
> > 
> > > I tried this patch. There are still multiple ways we can get wrong free
> > > block count. The patch reduced the number of errors. So we are doing
> > > better with patch. But I guess we can't use the percpu_counter based
> > > free block accounting with delalloc. Without delalloc it is ok even if
> > > we find some wrong free blocks count . The actual block allocation will fail in
> > > that case and we handle it perfectly fine. With delalloc we cannot
> > > afford to fail the block allocation. Should we look at a free block
> > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> > 
> > It would be a shame if we did given that the whole point of the percpu
> > counter was to avoid a scalability bottleneck.  Perhaps we could take
> > a filesystem-level spinlock only when the number of free blocks as
> > reported by the percpu_counter falls below some critical level?
> > 
> 
> Agree, and perhaps we should fall back to non-delalloc mode if the fs
> free blocks below some critical level?

How about this?

ext4: fall back to non delalloc mode if filesystem is almost full
From: Mingming Cao <cmm@...ibm.com>

In the case of filesystem is close to full (free blocks is below 
the watermark NRCPUS *4) and there is not enough to reserve blocks for
delayed allocation, instead of return user back with ENOSPC error, with
this patch, it tries to fall back to non delayed allocation mode.

Signed-off-by: Mingming Cao <cmm@...ibm.com>
---
 fs/ext4/ext4.h  |    2 -
 fs/ext4/inode.c |   61 ++++++++++++++++++++++++++++++++++++++++++++------------
 fs/ext4/namei.c |    4 +--
 3 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c	2008-08-20 15:20:10.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c	2008-08-20 16:13:48.000000000 -0700
@@ -2391,6 +2391,25 @@
 	return ret;
 }
 
+/*
+ * In case of filesystem is almost full and delalloc could not
+ * get enough free blocks to reserve to prevent later ENOSPC,
+ * let's fall back to the nondelalloc mode
+ */
+static int ext4_write_begin_nondelalloc(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	struct inode *inode = mapping->host;
+
+	/* turn off delalloc for this inode*/
+	ext4_set_aops(inode, 0);
+
+	return mapping->a_ops->write_begin(file, mapping, pos, len,
+					   flags, pagep, fsdata);
+}
+
 static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)
@@ -2435,8 +2454,14 @@
 		page_cache_release(page);
 	}
 
-	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
+	if (ret == -ENOSPC) {
+		if (ext4_should_retry_alloc(inode->i_sb, &retries))
+			goto retry;
+		else
+			ret= ext4_write_begin_nondelalloc(file,mapping,pos,
+							  len, flags, pagep,
+							  fsdata);
+	}
 out:
 	return ret;
 }
@@ -3008,16 +3033,26 @@
 	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
-void ext4_set_aops(struct inode *inode)
+#define	EXT4_MIN_FREE_BLOCKS	(NR_CPUS*4)
+
+void ext4_set_aops(struct inode *inode, int delalloc)
 {
-	if (ext4_should_order_data(inode) &&
-		test_opt(inode->i_sb, DELALLOC))
-		inode->i_mapping->a_ops = &ext4_da_aops;
-	else if (ext4_should_order_data(inode))
+	if (test_opt(inode->i_sb, DELALLOC)) {
+		if (ext4_has_free_blocks(EXT4_SB(inode->i_sb),
+			 EXT4_MIN_FREE_BLOCKS) > EXT4_MIN_FREE_BLOCKS)
+			delalloc = 0;
+
+		if (delalloc) {
+			inode->i_mapping->a_ops = &ext4_da_aops;
+			return;
+		} else
+			printk(KERN_INFO "filesystem is close to full, "
+				"delayed allocation is turned off for "
+				" inode %lu\n", inode->i_ino);
+	}
+
+	if (ext4_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext4_ordered_aops;
-	else if (ext4_should_writeback_data(inode) &&
-		 test_opt(inode->i_sb, DELALLOC))
-		inode->i_mapping->a_ops = &ext4_da_aops;
 	else if (ext4_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext4_writeback_aops;
 	else
@@ -4011,7 +4046,7 @@
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
-		ext4_set_aops(inode);
+		ext4_set_aops(inode, 1);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
 		inode->i_fop = &ext4_dir_operations;
@@ -4020,7 +4055,7 @@
 			inode->i_op = &ext4_fast_symlink_inode_operations;
 		else {
 			inode->i_op = &ext4_symlink_inode_operations;
-			ext4_set_aops(inode);
+			ext4_set_aops(inode, 1);
 		}
 	} else {
 		inode->i_op = &ext4_special_inode_operations;
@@ -4783,7 +4818,7 @@
 		EXT4_I(inode)->i_flags |= EXT4_JOURNAL_DATA_FL;
 	else
 		EXT4_I(inode)->i_flags &= ~EXT4_JOURNAL_DATA_FL;
-	ext4_set_aops(inode);
+	ext4_set_aops(inode, 1);
 
 	jbd2_journal_unlock_updates(journal);
 
Index: linux-2.6.27-rc3/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4.h	2008-08-20 15:41:36.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4.h	2008-08-20 15:41:56.000000000 -0700
@@ -1070,7 +1070,7 @@
 extern void ext4_truncate (struct inode *);
 extern void ext4_set_inode_flags(struct inode *);
 extern void ext4_get_inode_flags(struct ext4_inode_info *);
-extern void ext4_set_aops(struct inode *inode);
+extern void ext4_set_aops(struct inode *inode, int delalloc);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
Index: linux-2.6.27-rc3/fs/ext4/namei.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/namei.c	2008-08-20 15:42:13.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/namei.c	2008-08-20 15:42:41.000000000 -0700
@@ -1738,7 +1738,7 @@
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
-		ext4_set_aops(inode);
+		ext4_set_aops(inode, 1);
 		err = ext4_add_nondir(handle, dentry, inode);
 	}
 	ext4_journal_stop(handle);
@@ -2210,7 +2210,7 @@
 
 	if (l > sizeof (EXT4_I(inode)->i_data)) {
 		inode->i_op = &ext4_symlink_inode_operations;
-		ext4_set_aops(inode);
+		ext4_set_aops(inode, 1);
 		/*
 		 * page_symlink() calls into ext4_prepare/commit_write.
 		 * We have a transaction open.  All is sweetness.  It also sets


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html