linux-kernel - Re: [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZRb4BX8bhhyWEari@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date:   Fri, 29 Sep 2023 21:45:29 +0530
From:   Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To:     linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>
Cc:     Ritesh Harjani <ritesh.list@...il.com>,
        linux-kernel@...r.kernel.org, Jan Kara <jack@...e.cz>
Subject: Re: [PATCH 2/3] ext4: truncate complete range in pagecache before
 calling ext4_zero_partial_blocks()

On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote:
> In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
> however we only zero out the pagecache range that is block aligned. These
> functions are relying on ext4_zero_partial_blocks() ->
> __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
> the pageacache. However, the right thing to do is to properly zero out the whole
> range in these functions before and not rely on a different function to do it
> for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
> complete range.
> 
> This will also allow us to now exit early for unwritten buffer heads in
> __ext4_block_zero_page_range(), in upcoming patch.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> ---
>  fs/ext4/extents.c | 17 +++++++++++------
>  fs/ext4/inode.c   |  3 +--
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c79b4c25afc4..2dc681cab6a5 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  
>  	/* Zero range excluding the unaligned edges */
>  	if (max_blocks > 0) {
> -		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
> -			  EXT4_EX_NOCACHE);
> -
>  		/*
>  		 * Prevent page faults from reinstantiating pages we have
>  		 * released from page cache.
> @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  		 * disk in case of crash before zeroing trans is committed.
>  		 */
>  		if (ext4_should_journal_data(inode)) {
> -			ret = filemap_write_and_wait_range(mapping, start, end - 1);
> +			ret = filemap_write_and_wait_range(mapping, start,
> +							   end - 1);

I think this accidentally creeped in, will fix it in next rev.

Anyways, I had some questions that might be unrelated to this patch,
I'll add them inline:

>  			if (ret) {
>  				filemap_invalidate_unlock(mapping);
>  				goto out_mutex;
>  			}
>  		}
> +	}

So the above if (max_blocks) {...} block runs when the range spans
multiple blocks but I think the filemap_write_and_wait_range() and
ext4_update_disksize_before_punch() should be called when we are actually
spanning multiple pages, since the disksize not updating issue and the 
truncate racing with checkpoint only happen when the complete page is
truncated. Is this understanding correct? 

> +
> +	/*
> +	 * Now truncate the pagecache and zero out non page aligned edges of the
> +	 * range (if any)
> +	 */
> +	truncate_pagecache_range(inode, offset, offset + len - 1);
>  
> -		/* Now release the pages and zero block aligned part of pages */
> -		truncate_pagecache_range(inode, start, end - 1);
> +	if (max_blocks > 0) {
>  		inode->i_mtime = inode->i_ctime = current_time(inode);
>  
> +		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
>  		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
>  					     flags);
>  		filemap_invalidate_unlock(mapping);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c490f05e2ba..de8ea8430d30 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);

In this function ext4_punch_hole() I see that we call
filemap_write_and_wait_range() and then take the inode_lock() later.
Doesn't this leave a window for the pages to get dirty again? 

For example, in ext4_zero_range(), we checkpoint using
filemap_write_and_wait_range() in case of data=journal under
inode_lock() but that's not the case here. Just wondering if this 
or any other code path might still race here? 

Regards,
ojaswin

>  		if (ret)
>  			goto out_dio;
> -		truncate_pagecache_range(inode, first_block_offset,
> -					 last_block_offset);
>  	}
> +	truncate_pagecache_range(inode, offset, offset + length - 1);
>  
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		credits = ext4_writepage_trans_blocks(inode);
> -- 
> 2.39.3
>