[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8BP4luSfyvz-V4T@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Thu, 27 Feb 2025 17:13:30 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Baokun Li <libaokun1@...wei.com>
Cc: linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>,
Jan Kara <jack@...e.cz>, linux-kernel@...r.kernel.org,
Mahesh Kumar <maheshkumar657g@...il.com>,
Ritesh Harjani <ritesh.list@...il.com>,
Yang Erkun <yangerkun@...wei.com>
Subject: Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
On Wed, Feb 26, 2025 at 09:55:52AM +0800, Baokun Li wrote:
> On 2025/2/25 20:06, Ojaswin Mujoo wrote:
> > On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
> > > On 2025/2/22 16:40, Ojaswin Mujoo wrote:
> > > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > > running stress tests it was observed that in case of some error handling
> > > > paths, it is possible for update_super_work to start a transaction after
> > > > the journal is destroyed eg:
> > > >
> > > > (umount)
> > > > ext4_kill_sb
> > > > kill_block_super
> > > > generic_shutdown_super
> > > > sync_filesystem /* commits all txns */
> > > > evict_inodes
> > > > /* might start a new txn */
> > > > ext4_put_super
> > > > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > > jbd2_journal_destroy
> > > > journal_kill_thread
> > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > jbd2_journal_commit_transaction
> > > > jbd2_journal_get_descriptor_buffer
> > > > jbd2_journal_bmap
> > > > ext4_journal_bmap
> > > > ext4_map_blocks
> > > > ...
> > > > ext4_inode_error
> > > Just curious, since jbd2_journal_bmap() only queries the map and does not
> > > create it, how does it fail here? Is there more information in dmesg?
> > > Is s_journal_inum normal after file system corruption?
> > Hey Baokun,
> Hello Ojaswin,
>
> Thanks for your detailed explanation!
> >
> > So I dug a bit more into the vmcore. The error information in sbi looks
> > like this:
> >
> > s_add_error_count = 1,
> > s_first_error_code = 117,
> > s_first_error_line = 475,
> > s_first_error_ino = 0,
> > s_first_error_block = 0,
> > s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
> > s_first_error_time = 1737023235,
> >
> > s_last_error_code = 117,
> > s_last_error_line = 609,
> > s_last_error_ino = 8,
> > s_last_error_block = 783,
> > s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
> > s_last_error_time = 1737023236,
> >
> > The first error is here:
> >
> > if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
> > 474 (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
> > * 475 ext4_error(sb, "Invalid block bitmap block %llu in "
> > 476 "block_group %u", bitmap_blk, block_group);
> > 477 ext4_mark_group_bitmap_corrupted(sb, block_group,
> > 478 EXT4_GROUP_INFO_BBITMAP_CORRUPT);
> > 479 return ERR_PTR(-EFSCORRUPTED);
> > 480 }
> >
> > and the last error is here:
> >
> > 608 if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> > * 609 ret = check_block_validity(inode, map);
> > 610 if (ret != 0)
> > 611 return ret;
> > 612 }
> >
> >
> > And indeed we have the traces of the first error in dmesg:
> >
> > [75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > [75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> > [75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
> >
> > However, the last error seems strange. It seems like check_block_validity
> > should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
> > not recorded in the crash dump for some reason so idk the exact value at the
> > time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
> > which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
> > I'll look a bit more into the proc01 ltp to see if we can recreate the failure
> > to get more info.
> Right, check_block_validity() skips the journal inode check. If
> the journal inode check fails, that shows s_es->s_journal_inum and
> journal->j_inode->i_ino are different. The file system doesn't modify
> s_journal_inum, so it should be modified by some other writer bypassing
> the file system (i.e. writing to bare disk).
>
> If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
> directly or saving s_journal_inum to ext4_sb_info (which offers better
> compatibility).
Hey Baokun,
So I'm wondering if modifying the check in __check_block_validity the
right thing to do. The fact that something has changed the on disk
journal inode number is itself concerning and maybe stopping IO here is
the right thing to do instead of using a cached journal inode number and
silently ignoring the issue.
Anyways now that the journal inode is corrupted on disk, will the
recovery tooling even be able to read the journal anymore?
Regards,
ojaswin
>
>
> Cheers,
> Baokun
> > > Thanks,
> > > Baokun
> > > > ext4_handle_error
> > > > schedule_work(&sbi->s_sb_upd_work)
> > > >
> > > > /* work queue kicks in */
> > > > update_super_work
> > > > jbd2_journal_start
> > > > start_this_handle
> > > > BUG_ON(journal->j_flags &
> > > > JBD2_UNMOUNT)
> > > >
> > > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > > active. Otherwise, just fallback to an un-journaled commit.
> > > >
> > > > The important thing to note here is that we must only defer sb update if
> > > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > > this race can be hit (point 1 below). Since we don't have a direct way
> > > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > > subtle so adding some notes below for future reference:
> > > >
> > > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > > == 0) however this is not correct since we could end up scheduling work
> > > > after it has been flushed:
> > > >
> > > > ext4_put_super
> > > > flush_work(&sbi->s_sb_upd_work)
> > > >
> > > > **kjournald2**
> > > > jbd2_journal_commit_transaction
> > > > ...
> > > > ext4_inode_error
> > > > /* JBD2_UNMOUNT not set */
> > > > schedule_work(s_sb_upd_work)
> > > >
> > > > jbd2_journal_destroy
> > > > journal->j_flags |= JBD2_UNMOUNT;
> > > >
> > > > **workqueue**
> > > > update_super_work
> > > > jbd2_journal_start
> > > > start_this_handle
> > > > BUG_ON(JBD2_UNMOUNT)
> > > >
> > > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > > are sure that the workqueue would be flushed at a later point if we are
> > > > in the umount path.
> > > >
> > > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > > only called from mballoc and AFAICT it would be always valid to schedule
> > > > work here.
> > > >
> > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > Reported-by: Mahesh Kumar <maheshkumar657g@...il.com>
> > > > Suggested-by: Ritesh Harjani <ritesh.list@...il.com>
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> > > > ---
> > > > fs/ext4/super.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > > index a963ffda692a..b7341e9acf62 100644
> > > > --- a/fs/ext4/super.c
> > > > +++ b/fs/ext4/super.c
> > > > @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
> > > > * constraints, it may not be safe to do it right here so we
> > > > * defer superblock flushing to a workqueue.
> > > > */
> > > > - if (continue_fs && journal)
> > > > + if (continue_fs && journal && (sb->s_flags & SB_ACTIVE))
> > > > schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
> > > > else
> > > > ext4_commit_super(sb);
> > >
>
Powered by blists - more mailing lists