linux-kernel - [RFC PATCH] ext4: destroy inline data immediately when converting to extent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240208165808.5494-1-lhenriques@suse.de>
Date: Thu,  8 Feb 2024 16:58:07 +0000
From: Luis Henriques <lhenriques@...e.de>
To: "Theodore Y. Ts'o" <tytso@....edu>,
	Andreas Dilger <adilger.kernel@...ger.ca>
Cc: linux-ext4@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Luis Henriques <lhenriques@...e.de>,
	Daniel Dawson <danielcdawson@...il.com>
Subject: [RFC PATCH] ext4: destroy inline data immediately when converting to extent

When writing to an inode that has inline data and the amount of data written
exceeds the maximum inline data length, that data is un-inlined, i.e. it is
converted into an extent.  However, when delayed allocation is enabled the
destruction of the data is postponed until the data writeback.  This causes
consistency problems.  Here's a very simple test case, run on a filesystem
with delayed allocation and inline data features enabled:

 $ dd if=/dev/zero of=test-file bs=64 count=3 status=none
 $ lsattr test-file
 ------------------N--- test-file

The 'lsattr' command shows that the file has data stored inline.  However,
that is not correct because writing 192 bytes (3 * 64) has forced the data
to be un-inlined.  Doing a 'sync' before running the 'lsattr' fixes it.

Note that this bug doesn't happen if the filesytems is mount using the
'nodelalloc' option.

(There's a similar test case using read() instead in the bugzilla linked
bellow.)

This patch fixes the issue in the delayed allocation path by destroying the
inline data immediately after converting it to an extent instead of delaying
that operation until the writeback.  This is done by invoking function
ext4_destroy_inline_data_nolock(), which is going to clean-up all the
missing data structures, including clearing ĨNLINE_DATA and setting the
EXTENTS inode flags.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200681
Cc: Daniel Dawson <danielcdawson@...il.com>
Signed-off-by: Luis Henriques <lhenriques@...e.de>
---
Hi!

I'm sending this patch as an RFC because, although I've done a good amount
of testing, I'm still not convinced it is correct.  I.e. there may exist a
good reason for postponing the call to ext4_destroy_inline_data_nolock() and
that I'm failing to see it.  Please let me know what you think.

 fs/ext4/inline.c | 20 ++++++++++----------
 fs/ext4/inode.c  | 18 +-----------------
 2 files changed, 11 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index d5bd1e3a5d36..e19a176cfc93 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -830,11 +830,12 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
  *    update and dirty so that ext4_da_writepages can handle it. We don't
  *    need to start the journal since the file's metadata isn't changed now.
  */
-static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
+static int ext4_da_convert_inline_data_to_extent(handle_t *handle,
+						 struct address_space *mapping,
 						 struct inode *inode,
 						 void **fsdata)
 {
-	int ret = 0, inline_size;
+	int ret = 0, inline_size, no_expand;
 	struct folio *folio;
 
 	folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN,
@@ -842,7 +843,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
-	down_read(&EXT4_I(inode)->xattr_sem);
+	ext4_write_lock_xattr(inode, &no_expand);
 	if (!ext4_has_inline_data(inode)) {
 		ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
 		goto out;
@@ -859,20 +860,18 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 	ret = __block_write_begin(&folio->page, 0, inline_size,
 				  ext4_da_get_block_prep);
 	if (ret) {
-		up_read(&EXT4_I(inode)->xattr_sem);
+		ext4_write_unlock_xattr(inode, &no_expand);
 		folio_unlock(folio);
 		folio_put(folio);
 		ext4_truncate_failed_write(inode);
 		return ret;
 	}
 
-	folio_mark_dirty(folio);
-	folio_mark_uptodate(folio);
-	ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+	ret = ext4_destroy_inline_data_nolock(handle, inode);
 	*fsdata = (void *)CONVERT_INLINE_DATA;
 
 out:
-	up_read(&EXT4_I(inode)->xattr_sem);
+	ext4_write_unlock_xattr(inode, &no_expand);
 	if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);
@@ -916,10 +915,11 @@ int ext4_da_write_inline_data_begin(struct address_space *mapping,
 		goto out_journal;
 
 	if (ret == -ENOSPC) {
-		ext4_journal_stop(handle);
-		ret = ext4_da_convert_inline_data_to_extent(mapping,
+		ret = ext4_da_convert_inline_data_to_extent(handle,
+							    mapping,
 							    inode,
 							    fsdata);
+		ext4_journal_stop(handle);
 		if (ret == -ENOSPC &&
 		    ext4_should_retry_alloc(inode->i_sb, &retries))
 			goto retry_journal;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ccf3b5e3a7c..43fa930fafa0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2548,23 +2548,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 		goto out_writepages;
 	}
 
-	/*
-	 * If we have inline data and arrive here, it means that
-	 * we will soon create the block for the 1st page, so
-	 * we'd better clear the inline data here.
-	 */
-	if (ext4_has_inline_data(inode)) {
-		/* Just inode will be modified... */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out_writepages;
-		}
-		BUG_ON(ext4_test_inode_state(inode,
-				EXT4_STATE_MAY_INLINE_DATA));
-		ext4_destroy_inline_data(handle, inode);
-		ext4_journal_stop(handle);
-	}
+	WARN_ON_ONCE(ext4_has_inline_data(inode));
 
 	/*
 	 * data=journal mode does not do delalloc so we just need to writeout /