linux-ext4 - Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241118231521.GA9417@frogsfrogsfrogs>
Date: Mon, 18 Nov 2024 15:15:21 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, tytso@....edu,
	adilger.kernel@...ger.ca, jack@...e.cz, ritesh.list@...il.com,
	hch@...radead.org, david@...morbit.com, zokeefe@...gle.com,
	yi.zhang@...wei.com, chengzhihao1@...wei.com, yukuai3@...wei.com,
	yangerkun@...wei.com
Subject: Re: [PATCH 03/27] ext4: don't write back data before punch hole in
 nojournal mode

On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@...wei.com>
> 
> There is no need to write back all data before punching a hole in
> data=ordered|writeback mode since it will be dropped soon after removing
> space, so just remove the filemap_write_and_wait_range() in these modes.
> However, in data=journal mode, we need to write dirty pages out before
> discarding page cache in case of crash before committing the freeing
> data transaction, which could expose old, stale data.

Can't the same thing happen with non-journaled data writes?

Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
the same 1GB of file and immediately start punching it.  If the system
reboots before the mapping updates all get written to disk, won't you
risk seeing some of those "A" because we no longer flush the "B"s?

Also, since the program didn't explicitly fsync the Bs, why bother
flushing the dirty data at all?  Are data=journal writes supposed to be
synchronous flushing writes nowadays?

--D

> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> ---
>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f8796f7b0f94..94b923afcd9c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> +
> +		/*
> +		 * For journalled data we need to write (and checkpoint) pages
> +		 * before discarding page cache to avoid inconsitent data on
> +		 * disk in case of crash before punching trans is committed.
> +		 */
> +		if (ext4_should_journal_data(inode)) {
> +			ret = filemap_write_and_wait_range(mapping,
> +					first_block_offset, last_block_offset);
> +			if (ret)
> +				goto out_dio;
> +		}
> +
> +		ext4_truncate_folios_range(inode, first_block_offset,
> +					   last_block_offset + 1);
>  		truncate_pagecache_range(inode, first_block_offset,
>  					 last_block_offset);
>  	}
> -- 
> 2.46.1
> 
>