[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z2PJHssCBqVDDU7w@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date: Thu, 19 Dec 2024 12:49:58 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, yi.zhang@...wei.com, chengzhihao1@...wei.com,
yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v4 01/10] ext4: remove writable userspace mappings before
truncating page cache
On Wed, Dec 18, 2024 at 09:02:18PM +0800, Zhang Yi wrote:
> On 2024/12/18 17:56, Ojaswin Mujoo wrote:
> > On Mon, Dec 16, 2024 at 09:39:06AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@...wei.com>
> >>
> >> When zeroing a range of folios on the filesystem which block size is
> >> less than the page size, the file's mapped blocks within one page will
> >> be marked as unwritten, we should remove writable userspace mappings to
> >> ensure that ext4_page_mkwrite() can be called during subsequent write
> >> access to these partial folios. Otherwise, data written by subsequent
> >> mmap writes may not be saved to disk.
> >>
> >> $mkfs.ext4 -b 1024 /dev/vdb
> >> $mount /dev/vdb /mnt
> >> $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
> >> -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
> >> -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
> >>
> >> $od -Ax -t x1z /mnt/foo
> >> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> >> *
> >> 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
> >> *
> >> 001000
> >>
> >> $umount /mnt && mount /dev/vdb /mnt
> >> $od -Ax -t x1z /mnt/foo
> >> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> >> *
> >> 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> *
> >> 001000
> >>
> >> Fix this by introducing ext4_truncate_page_cache_block_range() to remove
> >> writable userspace mappings when truncating a partial folio range.
> >> Additionally, move the journal data mode-specific handlers and
> >> truncate_pagecache_range() into this function, allowing it to serve as a
> >> common helper that correctly manages the page cache in preparation for
> >> block range manipulations.
> >
> > Hi Zhang,
> >
> > Thanks for the fix, just to confirm my understanding, the issue arises
> > because of the following flow:
> >
> > 1. page_mkwrite() makes folio dirty when we write to the mmap'd region
> >
> > 2. ext4_zero_range (2kb to 4kb)
> > truncate_pagecache_range
> > truncate_inode_pages_range
> > truncate_inode_partial_folio
> > folio_zero_range (2kb to 4kb)
> > folio_invalidate
> > ext4_invalidate_folio
> > block_invalidate_folio -> clear the bh dirty bit
> >
> > 3. mwrite (2kb to 4kb): Again we write in pagecache but the bh is not
> > dirty hence after a remount the data is not seen on disk
> >
> > Also, we won't see this issue if we are zeroing a page aligned range
> > since we end up unmapping the pages from the proccess address space in
> > that case. Correct?
>
> Thank you for review! Yes, it's correct.
>
> >
> > I have also tested the patch in PowerPC with 64k pagesize and 4k blocks
> > size and can confirm that it fixes the data loss issue. That being said,
> > I have a few minor comments on the patch below:
> >
>
> Thank you for the test.
>
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> >> ---
> >> fs/ext4/ext4.h | 2 ++
> >> fs/ext4/extents.c | 19 ++++-----------
> >> fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
> >> 3 files changed, 69 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 74f2071189b2..8843929b46ce 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3016,6 +3016,8 @@ extern int ext4_inode_attach_jinode(struct inode *inode);
> >> extern int ext4_can_truncate(struct inode *inode);
> >> extern int ext4_truncate(struct inode *);
> >> extern int ext4_break_layouts(struct inode *);
> >> +extern int ext4_truncate_page_cache_block_range(struct inode *inode,
> >> + loff_t start, loff_t end);
> >> extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
> >> extern void ext4_set_inode_flags(struct inode *, bool init);
> >> extern int ext4_alloc_da_blocks(struct inode *inode);
> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> >> index a07a98a4b97a..8dc6b4271b15 100644
> >> --- a/fs/ext4/extents.c
> >> +++ b/fs/ext4/extents.c
> >> @@ -4667,22 +4667,13 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> >> goto out_mutex;
> >> }
> >>
> >> - /*
> >> - * For journalled data we need to write (and checkpoint) pages
> >> - * before discarding page cache to avoid inconsitent data on
> >> - * disk in case of crash before zeroing trans is committed.
> >> - */
> >> - if (ext4_should_journal_data(inode)) {
> >> - ret = filemap_write_and_wait_range(mapping, start,
> >> - end - 1);
> >> - if (ret) {
> >> - filemap_invalidate_unlock(mapping);
> >> - goto out_mutex;
> >> - }
> >> + /* Now release the pages and zero block aligned part of pages */
> >> + ret = ext4_truncate_page_cache_block_range(inode, start, end);
> >> + if (ret) {
> >> + filemap_invalidate_unlock(mapping);
> >> + goto out_mutex;
> >> }
> >>
> >> - /* Now release the pages and zero block aligned part of pages */
> >> - truncate_pagecache_range(inode, start, end - 1);
> >> inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> >>
> >> ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 89aade6f45f6..c68a8b841148 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -31,6 +31,7 @@
> >> #include <linux/writeback.h>
> >> #include <linux/pagevec.h>
> >> #include <linux/mpage.h>
> >> +#include <linux/rmap.h>
> >> #include <linux/namei.h>
> >> #include <linux/uio.h>
> >> #include <linux/bio.h>
> >> @@ -3902,6 +3903,67 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> >> return ret;
> >> }
> >>
> >> +static inline void ext4_truncate_folio(struct inode *inode,
> >> + loff_t start, loff_t end)
> >> +{
> >> + unsigned long blocksize = i_blocksize(inode);
> >> + struct folio *folio;
> >> +
> >> + /* Nothing to be done if no complete block needs to be truncated. */
> >> + if (round_up(start, blocksize) >= round_down(end, blocksize))
> >> + return;
> >> +
> >> + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> >> + if (IS_ERR(folio))
> >> + return;
> >> +
> >> + if (folio_mkclean(folio))
> >> + folio_mark_dirty(folio);
> >> + folio_unlock(folio);
> >> + folio_put(folio);
> >> +}
> >> +
> >> +int ext4_truncate_page_cache_block_range(struct inode *inode,
> >> + loff_t start, loff_t end)
> >> +{
> >> + unsigned long blocksize = i_blocksize(inode);
> >> + int ret;
> >> +
> >> + /*
> >> + * For journalled data we need to write (and checkpoint) pages
> >> + * before discarding page cache to avoid inconsitent data on disk
> >> + * in case of crash before freeing or unwritten converting trans
> >> + * is committed.
> >> + */
> >> + if (ext4_should_journal_data(inode)) {
> >> + ret = filemap_write_and_wait_range(inode->i_mapping, start,
> >> + end - 1);
> >> + if (ret)
> >> + return ret;
> >> + goto truncate_pagecache;
> >> + }
> >> +
> >> + /*
> >> + * If the block size is less than the page size, the file's mapped
> >> + * blocks within one page could be freed or converted to unwritten.
> >> + * So it's necessary to remove writable userspace mappings, and then
> >> + * ext4_page_mkwrite() can be called during subsequent write access
> >> + * to these partial folios.
> >> + */
> >> + if (blocksize < PAGE_SIZE && start < inode->i_size) {
> >
> > Maybe we should only call ext4_truncate_folio() if the range is not page
> > aligned, rather than calling it everytime for bs < ps?
>
> I agree with you, so how about below?
>
> if (!IS_ALIGNED(start | end, PAGE_SIZE) &&
> blocksize < PAGE_SIZE && start < inode->i_size && )
This looks good Zhang, with this change and the variable rename, feel free to add
Reviewed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Regards,
ojaswin
>
> >
> >> + loff_t start_boundary = round_up(start, PAGE_SIZE);
> >
> > I think page_boundary seems like a more suitable name for the variable.
>
> Yeah, it looks fine to me.
>
> Thanks,
> Yi.
>
Powered by blists - more mailing lists