[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z2KgY1FzZRIKAW3U@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date: Wed, 18 Dec 2024 15:43:55 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, yi.zhang@...wei.com, chengzhihao1@...wei.com,
yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v4 03/10] ext4: don't write back data before punch hole
in nojournal mode
On Wed, Dec 18, 2024 at 03:10:36PM +0800, Zhang Yi wrote:
> On 2024/12/17 22:31, Ojaswin Mujoo wrote:
> > On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@...wei.com>
> >>
> >> There is no need to write back all data before punching a hole in
> >> non-journaled mode since it will be dropped soon after removing space.
> >> Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> >
> > Hi, sorry I'm a bit late to this however following the discussion here
> > [1], I believe the initial concern was that we don't in PATCH v1 01/10
> > was that after truncating the pagecache, the ext4_alloc_file_blocks()
> > call might fail with errors like EIO, ENOMEM etc leading to inconsistent
> > data.
> >
> > Is my understanding correct that we realised that these are very rare
> > cases and are not worth the performance penalty of writeback? In which
> > case, is it really okay to just let the scope for corruption exist even
> > though its rare. There might be some other error cases we might be
> > missing which might be more easier to hit. For eg I think we can also
> > fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
> > unwritten extent conversion causing an extent split leading to extent
> > tree node allocation. (Maybe can be avoided by using PRE_IO with
> > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)
> >
> > So does it make sense to retain the writeback behavior or am I just
> > being paranoid :)
> >
>
> Hi, Ojaswin!
>
> Yeah, from my point of view, ENOSPC could happen, and it may be more
> likely to happen if we intentionally create conditions for it. However,
> all the efforts we can make at this point are merely best efforts and
> reduce the probability. We cannot 100% guarantee it will not happen,
> even if we write back the whole range before manipulating extents and
> blocks. This is because we do not accurately reserve space for extents
> split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag
Right, rechecking the ext4_map_blocks code, seems like we can also result
in a failure after unwrit extents have successfully been allocated so
either ways we can't be sure that we'll retain old data on failure even
with writeback.
> while freeing blocks to reduce the possibility of ENOSPC. So I suppose
> it's fine by now, but we may need to implement additional measures if
> we truly want to resolve the issue completely.
Sure I agree that in that case we should ideally have something more
robust to handle these edge cases. For now, this change looks good.
Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
>
> Thanks,
> Yi.
>
> >
> >> Besides, similar to ext4_zero_range(), we must address the case of
> >> partially punched folios when block size < page size. It is essential to
> >> remove writable userspace mappings to ensure that the folio can be
> >> faulted again during subsequent mmap write access.
> >>
> >> In journaled mode, we need to write dirty pages out before discarding
> >> page cache in case of crash before committing the freeing data
> >> transaction, which could expose old, stale data, even if synchronization
> >> has been performed.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> >> ---
> >> fs/ext4/inode.c | 18 +++++-------------
> >> 1 file changed, 5 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index bf735d06b621..a5ba2b71d508 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >>
> >> trace_ext4_punch_hole(inode, offset, length, 0);
> >>
> >> - /*
> >> - * Write out all dirty pages to avoid race conditions
> >> - * Then release them.
> >> - */
> >> - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >> - ret = filemap_write_and_wait_range(mapping, offset,
> >> - offset + length - 1);
> >> - if (ret)
> >> - return ret;
> >> - }
> >> -
> >> inode_lock(inode);
> >>
> >> /* No need to punch hole beyond i_size */
> >> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >> ret = ext4_update_disksize_before_punch(inode, offset, length);
> >> if (ret)
> >> goto out_dio;
> >> - truncate_pagecache_range(inode, first_block_offset,
> >> - last_block_offset);
> >> +
> >> + ret = ext4_truncate_page_cache_block_range(inode,
> >> + first_block_offset, last_block_offset + 1);
> >> + if (ret)
> >> + goto out_dio;
> >> }
> >>
> >> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >> --
> >> 2.46.1
> >>
>
Powered by blists - more mailing lists