linux-ext4 - Re: [PATCH v2] ext4: Fix potential data lost in recovering journal raced with synchronizing fs bdev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2b2718a4-7d8b-e0bc-c045-59fe7562392d@huawei.com>
Date:   Sat, 9 Sep 2023 11:41:11 +0800
From:   Zhang Yi <yi.zhang@...wei.com>
To:     Zhihao Cheng <chengzhihao1@...wei.com>, <tytso@....edu>,
        <jack@...e.com>
CC:     <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v2] ext4: Fix potential data lost in recovering journal
 raced with synchronizing fs bdev

Hello!

On 2023/9/8 20:43, Zhihao Cheng wrote:
> JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
> however, other process could intercept the EIO information from bdev's
> mapping, which leads journal recovering successful even EIO occurs during
> data written back to fs device.
> 
> We found this problem in our product, iscsi + multipath is chosen for block
> device of ext4. Unstable network may trigger kpartx to rescan partitions in
> device mapper layer. Detailed process is shown as following:
> 
>   mount          kpartx          irq
> jbd2_journal_recover
>  do_one_pass
>   memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
>   mark_buffer_dirty // mark bh dirty
>          vfs_read
> 	  generic_file_read_iter // dio
> 	   filemap_write_and_wait_range
> 	    __filemap_fdatawrite_range
> 	     do_writepages
> 	      block_write_full_folio
> 	       submit_bh_wbc
> 	            >>  EIO occurs in disk  <<
> 	                     end_buffer_async_write
> 			      mark_buffer_write_io_error
> 			       mapping_set_error
> 			        set_bit(AS_EIO, &mapping->flags) // set!
> 	    filemap_check_errors
> 	     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
>  err2 = sync_blockdev
>   filemap_write_and_wait
>    filemap_check_errors
>     test_and_clear_bit(AS_EIO, &mapping->flags) // false
>  err2 = 0
> 
> Filesystem is mounted successfully even data from journal is failed written
> into disk, and ext4 could become corrupted.
> 
> Fix it by comparing 'sbi->s_bdev_wb_err' before loading journal and after
> loading journal.
> 
> Fetch a reproducer in [Link].
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
> Cc: stable@...r.kernel.org
> Signed-off-by: Zhihao Cheng <chengzhihao1@...wei.com>
> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> ---
>  v1->v2: Checks wb_err from block device only in ext4.
>  fs/ext4/super.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 38217422f938..4dcaad2403be 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4907,6 +4907,14 @@ static int ext4_load_and_init_journal(struct super_block *sb,
>  	if (err)
>  		return err;
>  
> +	err = errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> +				       &sbi->s_bdev_wb_err);
> +	if (err) {
> +		ext4_msg(sb, KERN_ERR, "Background error %d when loading journal",
> +			 err);
> +		goto out;
> +	}
> +

This solution cannot solve the problem, because the journal tail is
still updated in journal_reset() even if we detect the writeback error
and refuse to mount the ext4 filesystem here. So I suppose we have to
check the I/O error by jbd2 module itself like v1 does.

Thanks,
Yi.

>  	if (ext4_has_feature_64bit(sb) &&
>  	    !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
>  				       JBD2_FEATURE_INCOMPAT_64BIT)) {
> @@ -5365,6 +5373,13 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  			goto failed_mount3a;
>  	}
>  
> +	/*
> +	 * Save the original bdev mapping's wb_err value which could be
> +	 * used to detect the metadata async write error.
> +	 */
> +	spin_lock_init(&sbi->s_bdev_wb_lock);
> +	errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> +				 &sbi->s_bdev_wb_err);
>  	err = -EINVAL;
>  	/*
>  	 * The first inode we look at is the journal inode.  Don't try
> @@ -5571,13 +5586,6 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	}
>  #endif  /* CONFIG_QUOTA */
>  
> -	/*
> -	 * Save the original bdev mapping's wb_err value which could be
> -	 * used to detect the metadata async write error.
> -	 */
> -	spin_lock_init(&sbi->s_bdev_wb_lock);
> -	errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> -				 &sbi->s_bdev_wb_err);
>  	EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
>  	ext4_orphan_cleanup(sb, es);
>  	EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
>