[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z8b4Aze43cnDrsBF@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Tue, 4 Mar 2025 18:24:27 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>,
linux-kernel@...r.kernel.org, Mahesh Kumar <maheshkumar657g@...il.com>,
Ritesh Harjani <ritesh.list@...il.com>
Subject: Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
On Tue, Mar 04, 2025 at 11:25:02AM +0100, Jan Kara wrote:
> On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote:
> > On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
> > > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> > > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > > running stress tests it was observed that in case of some error handling
> > > > paths, it is possible for update_super_work to start a transaction after
> > > > the journal is destroyed eg:
> > > >
> > > > (umount)
> > > > ext4_kill_sb
> > > > kill_block_super
> > > > generic_shutdown_super
> > > > sync_filesystem /* commits all txns */
> > > > evict_inodes
> > > > /* might start a new txn */
> > > > ext4_put_super
> > > > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > > jbd2_journal_destroy
> > > > journal_kill_thread
> > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > jbd2_journal_commit_transaction
> > > > jbd2_journal_get_descriptor_buffer
> > > > jbd2_journal_bmap
> > > > ext4_journal_bmap
> > > > ext4_map_blocks
> > > > ...
> > > > ext4_inode_error
> > > > ext4_handle_error
> > > > schedule_work(&sbi->s_sb_upd_work)
> > > >
> > > > /* work queue kicks in */
> > > > update_super_work
> > > > jbd2_journal_start
> > > > start_this_handle
> > > > BUG_ON(journal->j_flags &
> > > > JBD2_UNMOUNT)
> > > >
> > > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > > active. Otherwise, just fallback to an un-journaled commit.
> > > >
> > > > The important thing to note here is that we must only defer sb update if
> > > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > > this race can be hit (point 1 below). Since we don't have a direct way
> > > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > > subtle so adding some notes below for future reference:
> > > >
> > > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > > == 0) however this is not correct since we could end up scheduling work
> > > > after it has been flushed:
> > > >
> > > > ext4_put_super
> > > > flush_work(&sbi->s_sb_upd_work)
> > > >
> > > > **kjournald2**
> > > > jbd2_journal_commit_transaction
> > > > ...
> > > > ext4_inode_error
> > > > /* JBD2_UNMOUNT not set */
> > > > schedule_work(s_sb_upd_work)
> > > >
> > > > jbd2_journal_destroy
> > > > journal->j_flags |= JBD2_UNMOUNT;
> > > >
> > > > **workqueue**
> > > > update_super_work
> > > > jbd2_journal_start
> > > > start_this_handle
> > > > BUG_ON(JBD2_UNMOUNT)
> > > >
> > > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > > are sure that the workqueue would be flushed at a later point if we are
> > > > in the umount path.
> > > >
> > > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > > only called from mballoc and AFAICT it would be always valid to schedule
> > > > work here.
> > > >
> > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > Reported-by: Mahesh Kumar <maheshkumar657g@...il.com>
> > > > Suggested-by: Ritesh Harjani <ritesh.list@...il.com>
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> > >
> > > Good catch! But I think the solution will have to be slightly different.
> > > Basing the check on SB_ACTIVE has the problem that you can have racing
> > > updates of the sb in the still running transaction and in your direct
> > > update leading to inconsistencies after a crash (that was the reason why
> > > we've created the s_sb_upd_work in the first place).
> > >
> > > I would solve this by implementing something like
> > > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
> > > workqueue, and then destroy the journal. And ext4_handle_error() will check
> > > for the sbi flag.
> > >
> > > Honza
> >
> > Hey Jan,
> >
> > Thanks for the review. So earlier I did go through different code paths to see
> > if we will have a direct sb write clash with a journalled one it wouldn't but,
> > relooking at it, seems like we might have a scenario as follows:
> >
> > generic_super_shutdown
> > sync_filesytems
> > /* running txns committed. executing ext4_journal_commit_callback */
> > ext4_maybe_update_superblock
> > /* schedules work */
> > schedule_work(&sbi->s_sb_upd_work)
> > update_super_work
> > /* start a txn and add sb to it */
> > sb->s_flags &= ~SB_ACTIVE;
> > evict_inode
> > ext4_evict_inode
> > ext4_std_error
> > ext4_handle_error
> > /* direct commit of sb (Not good!) */
> >
> >
> > Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
> > enough to handle this as well. For example, if we add a flag like
> > sbi->s_journal_destroying, then:
> >
> > ext4_put_super
> > sbi->s_journal_destroying = true
> > flush_workqueue()
> > /* sb is now journalled */
> > jbd2_journal_destory
> > jbd2_journal_commit_transaction
> > /* add tag for sb in descriptor and add buffer to wbufs[] */
> > /* Later from some other buffer in the txn: */
> > jbd2_journal_next_log_block
> > /* hits error in ext4_journal_bmap */
> > ext4_handle_error
> > sbi->s_journal_destroying == true
> > /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
> > jbd2_journal_abort
> > /* after abort everything in wbufs[] is written to journal */
> >
> > In the above we will have a checksum mismatch but then maybe its not really
> > an issue. Maybe since we never commit the txn it is understood that the contents
> > can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
> > and the actual super block contents? In which case the sbi flag approach should
> > be fine.
> >
> > Does my understanding sound correct?
>
> Yes. Since the transaction does not get committed, its contents will be
> (and must be) ignored. So although you are correct that the superblock
> content in the transaction need to match the content we write directly, it
> does not matter because whatever is in the uncommitted transaction must
> never be written to the final position on disk.
>
> Honza
Got it, thanks for the review. I'll address this in v2
Regards,
ojaswin
>
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR
Powered by blists - more mailing lists