linux-kernel - Re: [PATCH v4 03/10] ext4: don't write back data before punch hole in nojournal mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z2KgY1FzZRIKAW3U@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date: Wed, 18 Dec 2024 15:43:55 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
        jack@...e.cz, yi.zhang@...wei.com, chengzhihao1@...wei.com,
        yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v4 03/10] ext4: don't write back data before punch hole
 in nojournal mode

On Wed, Dec 18, 2024 at 03:10:36PM +0800, Zhang Yi wrote:
> On 2024/12/17 22:31, Ojaswin Mujoo wrote:
> > On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@...wei.com>
> >>
> >> There is no need to write back all data before punching a hole in
> >> non-journaled mode since it will be dropped soon after removing space.
> >> Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> > 
> > Hi, sorry I'm a bit late to this however following the discussion here
> > [1], I believe the initial concern was that we don't in PATCH v1 01/10 
> > was that after truncating the pagecache, the ext4_alloc_file_blocks()
> > call might fail with errors like EIO, ENOMEM etc leading to inconsistent
> > data. 
> > 
> > Is my understanding correct that  we realised that these are very rare
> > cases and are not worth the performance penalty of writeback? In which
> > case, is it really okay to just let the scope for corruption exist even
> > though its rare. There might be some other error cases we might be
> > missing which might be more easier to hit. For eg I think we can also
> > fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
> > unwritten extent conversion causing an extent split leading to  extent
> > tree node allocation. (Maybe can be avoided by using PRE_IO with
> > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)
> > 
> > So does it make sense to retain the writeback behavior or am I just
> > being paranoid :) 
> > 
> 
> Hi, Ojaswin!
> 
> Yeah, from my point of view, ENOSPC could happen, and it may be more
> likely to happen if we intentionally create conditions for it. However,
> all the efforts we can make at this point are merely best efforts and
> reduce the probability. We cannot 100% guarantee it will not happen,
> even if we write back the whole range before manipulating extents and
> blocks. This is because we do not accurately reserve space for extents
> split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag

Right, rechecking the ext4_map_blocks code, seems like we can also result
in a failure after unwrit extents have successfully been allocated so 
either ways we can't be sure that we'll retain old data on failure even
with writeback. 

> while freeing blocks to reduce the possibility of ENOSPC. So I suppose
> it's fine by now, but we may need to implement additional measures if
> we truly want to resolve the issue completely.

Sure I agree that in that case we should ideally have something more
robust to handle these edge cases. For now, this change looks good.

Feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>

> 
> Thanks,
> Yi.
> 
> > 
> >> Besides, similar to ext4_zero_range(), we must address the case of
> >> partially punched folios when block size < page size. It is essential to
> >> remove writable userspace mappings to ensure that the folio can be
> >> faulted again during subsequent mmap write access.
> >>
> >> In journaled mode, we need to write dirty pages out before discarding
> >> page cache in case of crash before committing the freeing data
> >> transaction, which could expose old, stale data, even if synchronization
> >> has been performed.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> >> ---
> >>  fs/ext4/inode.c | 18 +++++-------------
> >>  1 file changed, 5 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index bf735d06b621..a5ba2b71d508 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >>  
> >>  	trace_ext4_punch_hole(inode, offset, length, 0);
> >>  
> >> -	/*
> >> -	 * Write out all dirty pages to avoid race conditions
> >> -	 * Then release them.
> >> -	 */
> >> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >> -		ret = filemap_write_and_wait_range(mapping, offset,
> >> -						   offset + length - 1);
> >> -		if (ret)
> >> -			return ret;
> >> -	}
> >> -
> >>  	inode_lock(inode);
> >>  
> >>  	/* No need to punch hole beyond i_size */
> >> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> >>  		if (ret)
> >>  			goto out_dio;
> >> -		truncate_pagecache_range(inode, first_block_offset,
> >> -					 last_block_offset);
> >> +
> >> +		ret = ext4_truncate_page_cache_block_range(inode,
> >> +				first_block_offset, last_block_offset + 1);
> >> +		if (ret)
> >> +			goto out_dio;
> >>  	}
> >>  
> >>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >> -- 
> >> 2.46.1
> >>
>