[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mpm3x7uonxoc73lgva72vaiydc76cmr5niapm45ipk6ts5voab@e7zundhoui6i>
Date: Tue, 4 Mar 2025 11:25:02 +0100
From: Jan Kara <jack@...e.cz>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org,
Theodore Ts'o <tytso@....edu>, linux-kernel@...r.kernel.org,
Mahesh Kumar <maheshkumar657g@...il.com>, Ritesh Harjani <ritesh.list@...il.com>
Subject: Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE
On Thu 27-02-25 16:50:22, Ojaswin Mujoo wrote:
> On Mon, Feb 24, 2025 at 03:52:00PM +0100, Jan Kara wrote:
> > On Sat 22-02-25 14:10:22, Ojaswin Mujoo wrote:
> > > Presently we always BUG_ON if trying to start a transaction on a journal
> > > marked with JBD2_UNMOUNT, since this should never happen. However while
> > > running stress tests it was observed that in case of some error handling
> > > paths, it is possible for update_super_work to start a transaction after
> > > the journal is destroyed eg:
> > >
> > > (umount)
> > > ext4_kill_sb
> > > kill_block_super
> > > generic_shutdown_super
> > > sync_filesystem /* commits all txns */
> > > evict_inodes
> > > /* might start a new txn */
> > > ext4_put_super
> > > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > jbd2_journal_destroy
> > > journal_kill_thread
> > > journal->j_flags |= JBD2_UNMOUNT;
> > > jbd2_journal_commit_transaction
> > > jbd2_journal_get_descriptor_buffer
> > > jbd2_journal_bmap
> > > ext4_journal_bmap
> > > ext4_map_blocks
> > > ...
> > > ext4_inode_error
> > > ext4_handle_error
> > > schedule_work(&sbi->s_sb_upd_work)
> > >
> > > /* work queue kicks in */
> > > update_super_work
> > > jbd2_journal_start
> > > start_this_handle
> > > BUG_ON(journal->j_flags &
> > > JBD2_UNMOUNT)
> > >
> > > Hence, make sure we only defer the update of ext4 sb if the sb is still
> > > active. Otherwise, just fallback to an un-journaled commit.
> > >
> > > The important thing to note here is that we must only defer sb update if
> > > we have not yet flushed the s_sb_update_work queue in umount path else
> > > this race can be hit (point 1 below). Since we don't have a direct way
> > > to check for that we use SB_ACTIVE instead. The SB_ACTIVE check is a bit
> > > subtle so adding some notes below for future reference:
> > >
> > > 1. Ideally we would want to have a something like (flags & JBD2_UNMOUNT
> > > == 0) however this is not correct since we could end up scheduling work
> > > after it has been flushed:
> > >
> > > ext4_put_super
> > > flush_work(&sbi->s_sb_upd_work)
> > >
> > > **kjournald2**
> > > jbd2_journal_commit_transaction
> > > ...
> > > ext4_inode_error
> > > /* JBD2_UNMOUNT not set */
> > > schedule_work(s_sb_upd_work)
> > >
> > > jbd2_journal_destroy
> > > journal->j_flags |= JBD2_UNMOUNT;
> > >
> > > **workqueue**
> > > update_super_work
> > > jbd2_journal_start
> > > start_this_handle
> > > BUG_ON(JBD2_UNMOUNT)
> > >
> > > Something like the above doesn't happen with SB_ACTIVE check because we
> > > are sure that the workqueue would be flushed at a later point if we are
> > > in the umount path.
> > >
> > > 2. We don't need a similar check in ext4_grp_locked_error since it is
> > > only called from mballoc and AFAICT it would be always valid to schedule
> > > work here.
> > >
> > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > Reported-by: Mahesh Kumar <maheshkumar657g@...il.com>
> > > Suggested-by: Ritesh Harjani <ritesh.list@...il.com>
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> >
> > Good catch! But I think the solution will have to be slightly different.
> > Basing the check on SB_ACTIVE has the problem that you can have racing
> > updates of the sb in the still running transaction and in your direct
> > update leading to inconsistencies after a crash (that was the reason why
> > we've created the s_sb_upd_work in the first place).
> >
> > I would solve this by implementing something like
> > ext4_update_sb_destroy_journal() which will set a flag in sbi, flush the
> > workqueue, and then destroy the journal. And ext4_handle_error() will check
> > for the sbi flag.
> >
> > Honza
>
> Hey Jan,
>
> Thanks for the review. So earlier I did go through different code paths to see
> if we will have a direct sb write clash with a journalled one it wouldn't but,
> relooking at it, seems like we might have a scenario as follows:
>
> generic_super_shutdown
> sync_filesytems
> /* running txns committed. executing ext4_journal_commit_callback */
> ext4_maybe_update_superblock
> /* schedules work */
> schedule_work(&sbi->s_sb_upd_work)
> update_super_work
> /* start a txn and add sb to it */
> sb->s_flags &= ~SB_ACTIVE;
> evict_inode
> ext4_evict_inode
> ext4_std_error
> ext4_handle_error
> /* direct commit of sb (Not good!) */
>
>
> Now with the 'setting the flag in sbi' approach, I'm not sure if that will be
> enough to handle this as well. For example, if we add a flag like
> sbi->s_journal_destroying, then:
>
> ext4_put_super
> sbi->s_journal_destroying = true
> flush_workqueue()
> /* sb is now journalled */
> jbd2_journal_destory
> jbd2_journal_commit_transaction
> /* add tag for sb in descriptor and add buffer to wbufs[] */
> /* Later from some other buffer in the txn: */
> jbd2_journal_next_log_block
> /* hits error in ext4_journal_bmap */
> ext4_handle_error
> sbi->s_journal_destroying == true
> /* update and commit sb directly causing a checksum mismatch b/w entry in descriptor */
> jbd2_journal_abort
> /* after abort everything in wbufs[] is written to journal */
>
> In the above we will have a checksum mismatch but then maybe its not really
> an issue. Maybe since we never commit the txn it is understood that the contents
> can't be trusted and it should be fine to have a mismatch b/w the decriptor tag
> and the actual super block contents? In which case the sbi flag approach should
> be fine.
>
> Does my understanding sound correct?
Yes. Since the transaction does not get committed, its contents will be
(and must be) ignored. So although you are correct that the superblock
content in the transaction need to match the content we write directly, it
does not matter because whatever is in the uncommitted transaction must
never be written to the final position on disk.
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists