linux-kernel - Re: [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZUJydhIfHauMZ78p@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date:   Wed, 1 Nov 2023 21:15:26 +0530
From:   Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>,
        Ritesh Harjani <ritesh.list@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/3] ext4: truncate complete range in pagecache before
 calling ext4_zero_partial_blocks()

On Thu, Oct 19, 2023 at 06:55:46PM +0200, Jan Kara wrote:
> On Fri 29-09-23 21:45:29, Ojaswin Mujoo wrote:
> > On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote:
> > > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
> > > however we only zero out the pagecache range that is block aligned. These
> > > functions are relying on ext4_zero_partial_blocks() ->
> > > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
> > > the pageacache. However, the right thing to do is to properly zero out the whole
> > > range in these functions before and not rely on a different function to do it
> > > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
> > > complete range.
> > > 
> > > This will also allow us to now exit early for unwritten buffer heads in
> > > __ext4_block_zero_page_range(), in upcoming patch.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> > > ---
> > >  fs/ext4/extents.c | 17 +++++++++++------
> > >  fs/ext4/inode.c   |  3 +--
> > >  2 files changed, 12 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index c79b4c25afc4..2dc681cab6a5 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> > >  
> > >  	/* Zero range excluding the unaligned edges */
> > >  	if (max_blocks > 0) {
> > > -		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
> > > -			  EXT4_EX_NOCACHE);
> > > -
> > >  		/*
> > >  		 * Prevent page faults from reinstantiating pages we have
> > >  		 * released from page cache.
> > > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> > >  		 * disk in case of crash before zeroing trans is committed.
> > >  		 */
> > >  		if (ext4_should_journal_data(inode)) {
> > > -			ret = filemap_write_and_wait_range(mapping, start, end - 1);
> > > +			ret = filemap_write_and_wait_range(mapping, start,
> > > +							   end - 1);
> > 
> > I think this accidentally creeped in, will fix it in next rev.
> 
> Yeah, just pull it in patch 1.
> 
> > >  			if (ret) {
> > >  				filemap_invalidate_unlock(mapping);
> > >  				goto out_mutex;
> > >  			}
> > >  		}
> > > +	}
> > 
> > So the above if (max_blocks) {...} block runs when the range spans
> > multiple blocks but I think the filemap_write_and_wait_range() and
> > ext4_update_disksize_before_punch() should be called when we are actually
> > spanning multiple pages, since the disksize not updating issue and the 
> > truncate racing with checkpoint only happen when the complete page is
> > truncated. Is this understanding correct? 
> 
> Why do you think the issues apply only to multiple pages? I mean even if a
> single block is dirty in memory, it may be pushing i_disksize or carrying
> journalled data we need to commit.

Hey Jan, 

You are right, I think I was misunderstanding this code a bit, thinking
that these things would only be needed if the complete folio is absent.
Upon rechecking the paths like the writeback path I can now see that
even if the blocks till i_size are already allocated (eg, through
ext4_zero_range) then we won't actually be calling
mpage_map_and_submit_extent() which is where disksize updates.

> 
> > > +	/*
> > > +	 * Now truncate the pagecache and zero out non page aligned edges of the
> > > +	 * range (if any)
> > > +	 */
> > > +	truncate_pagecache_range(inode, offset, offset + len - 1);
> > >  
> > > -		/* Now release the pages and zero block aligned part of pages */
> > > -		truncate_pagecache_range(inode, start, end - 1);
> > > +	if (max_blocks > 0) {
> > >  		inode->i_mtime = inode->i_ctime = current_time(inode);
> > >  
> > > +		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
> > >  		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
> > >  					     flags);
> > >  		filemap_invalidate_unlock(mapping);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 6c490f05e2ba..de8ea8430d30 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> > >  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> > 
> > In this function ext4_punch_hole() I see that we call
> > filemap_write_and_wait_range() and then take the inode_lock() later.
> > Doesn't this leave a window for the pages to get dirty again? 
> 
> There's definitely a race window but I think the call to
> filemap_write_and_wait_range() is a leftover from the past when hole
> punching could race in a nasty way. These days we have invalidate_lock so I
> *think* we can just remove that filemap_write_and_wait_range() call. At
> least I don't see a good reason for it now because the pages are going away
> anyway. But it needs testing :).

> 
> > For example, in ext4_zero_range(), we checkpoint using
> > filemap_write_and_wait_range() in case of data=journal under
> > inode_lock() but that's not the case here. Just wondering if this 
> > or any other code path might still race here? 
> 
> Well, that's a bit different story as the comment there explains. And yes,
> invalidate_lock protects us from possible races there.

Ahh okay, got it.

> 
> > >  		if (ret)
> > >  			goto out_dio;
> > > -		truncate_pagecache_range(inode, first_block_offset,
> > > -					 last_block_offset);
> > >  	}
> > > +	truncate_pagecache_range(inode, offset, offset + length - 1);
> 
> But I have realized that changes done in this patch actually don't help
> with changing ext4_zero_partial_blocks() because as soon as we drop
> invalidate_lock, a page fault can come in and modify contents of partial
> pages we need zeroed.
> 
> So thinking about this again with fresh mind, these block vs pagecache
> consistency issues aren't probably worth it and current code flow is good
> enough. Sorry for misleading you. We might just add a comment to
> __ext4_block_zero_page_range() to explain that buffer_unwritten() buffers
> can get there but they should be already zeroed-out and uptodate and we do
> need to process them because of page cache zeroing. What do you think?

Oh right, I was not thinking from the mmap path, sorry about that. In
that case I think your point makes sense, lets just let this be for now.
I'll send a v2 with the first patch of the series and also add a comment
as you suggested.

Thanks for the review and taking the time to answer my questions!

Regards,
ojaswin
> 
> 								Honza
> -- 
> Jan Kara <jack@...e.com>
> SUSE Labs, CR