[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bct36ajzi6sardnmc6yz4ot4fbpr654b4k2xz54mrtyje7wofq@qpwzbtctwqnf>
Date: Wed, 12 Mar 2025 11:51:03 +0100
From: Jan Kara <jack@...e.cz>
To: Ritesh Harjani <ritesh.list@...il.com>
Cc: Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-ext4@...r.kernel.org,
Theodore Ts'o <tytso@....edu>, Jan Kara <jack@...e.cz>, Baokun Li <libaokun1@...wei.com>,
linux-kernel@...r.kernel.org, Mahesh Kumar <maheshkumar657g@...il.com>
Subject: Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if
journal is destroying
On Mon 10-03-25 10:13:36, Ritesh Harjani wrote:
> Ojaswin Mujoo <ojaswin@...ux.ibm.com> writes:
> > On Sun, Mar 09, 2025 at 12:11:22AM +0530, Ritesh Harjani wrote:
> >> Ojaswin Mujoo <ojaswin@...ux.ibm.com> writes:
> >> > On Sat, Mar 08, 2025 at 06:56:23PM +0530, Ritesh Harjani wrote:
> >> >> Ojaswin Mujoo <ojaswin@...ux.ibm.com> writes:
> >> >> > On Sat, Mar 08, 2025 at 03:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> >> >> >> Ojaswin Mujoo <ojaswin@...ux.ibm.com> writes:
> >> >> >> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> >> >> >> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >> >> >> > stress tests, it was observed that in case of some error handling paths, it is
> >> >> >> > possible for update_super_work to start a transaction after the journal is
> >> >> >> > destroyed eg:
> >> >> >> >
> >> >> >> > (umount)
> >> >> >> > ext4_kill_sb
> >> >> >> > kill_block_super
> >> >> >> > generic_shutdown_super
> >> >> >> > sync_filesystem /* commits all txns */
> >> >> >> > evict_inodes
> >> >> >> > /* might start a new txn */
> >> >> >> > ext4_put_super
> >> >> >> > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >> >> >> > jbd2_journal_destroy
> >> >> >> > journal_kill_thread
> >> >> >> > journal->j_flags |= JBD2_UNMOUNT;
> >> >> >> > jbd2_journal_commit_transaction
> >> >> >> > jbd2_journal_get_descriptor_buffer
> >> >> >> > jbd2_journal_bmap
> >> >> >> > ext4_journal_bmap
> >> >> >> > ext4_map_blocks
> >> >> >> > ...
> >> >> >> > ext4_inode_error
> >> >> >> > ext4_handle_error
> >> >> >> > schedule_work(&sbi->s_sb_upd_work)
> >> >> >> >
> >> >> >> > /* work queue kicks in */
> >> >> >> > update_super_work
> >> >> >> > jbd2_journal_start
> >> >> >> > start_this_handle
> >> >> >> > BUG_ON(journal->j_flags &
> >> >> >> > JBD2_UNMOUNT)
> >> >> >> >
> >> >> >> > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >> >> >> > destroying only do a journaled (and deferred) update of sb if this flag is not
> >> >> >> > set. Otherwise, just fallback to an un-journaled commit.
> >> >> >> >
> >> >> >> > We set sbi->s_journal_destroying = true only after all the FS updates are done
> >> >> >> > during ext4_put_super() (except a running transaction that will get commited
> >> >> >> > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >> >> >> > outside the journal as it won't race with a journaled update (refer
> >> >> >> > 2d01ddc86606).
> >> >> >> >
> >> >> >> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >> >> >> > called from mballoc and AFAICT it would be always valid to schedule work here.
> >> >> >> >
> >> >> >> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >> >> >> > Reported-by: Mahesh Kumar <maheshkumar657g@...il.com>
> >> >> >> > Suggested-by: Jan Kara <jack@...e.cz>
> >> >> >> > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> >> >> >> > ---
> >> >> >> > fs/ext4/ext4.h | 2 ++
> >> >> >> > fs/ext4/ext4_jbd2.h | 8 ++++++++
> >> >> >> > fs/ext4/super.c | 4 +++-
> >> >> >> > 3 files changed, 13 insertions(+), 1 deletion(-)
> >> >> >> >
> >> >> >> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> >> >> > index 2b7d781bfcad..d48e93bd5690 100644
> >> >> >> > --- a/fs/ext4/ext4.h
> >> >> >> > +++ b/fs/ext4/ext4.h
> >> >> >> > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >> >> >> > */
> >> >> >> > struct work_struct s_sb_upd_work;
> >> >> >> >
> >> >> >> > + bool s_journal_destorying;
> >> >> >> > +
> >> >> >> > /* Atomic write unit values in bytes */
> >> >> >> > unsigned int s_awu_min;
> >> >> >> > unsigned int s_awu_max;
> >> >> >> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >> >> >> > index 9b3c9df02a39..6bd3ca84410d 100644
> >> >> >> > --- a/fs/ext4/ext4_jbd2.h
> >> >> >> > +++ b/fs/ext4/ext4_jbd2.h
> >> >> >> > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >> >> >> > {
> >> >> >> > int err = 0;
> >> >> >> >
> >> >> >> > + /*
> >> >> >> > + * At this point all pending FS updates should be done except a possible
> >> >> >> > + * running transaction (which will commit in jbd2_journal_destroy). It
> >> >> >> > + * is now safe for any new errors to directly commit superblock rather
> >> >> >> > + * than going via journal.
> >> >> >> > + */
> >> >> >> > + sbi->s_journal_destorying = true;
> >> >> >>
> >> >> >> This is not correct right. I think what we decided to set this flag
> >> >> >> before we flush the workqueue. So that we don't schedule any new
> >> >> >> work after this flag has been set. At least that is what I understood.
> >> >> >>
> >> >> >> [1]: https://lore.kernel.org/all/87eczc6rlt.fsf@gmail.com/
> >> >> >>
> >> >> >> -ritesh
> >> >> >
> >> >> > Hey Ritesh,
> >> >> >
> >> >> > Yes that is not correct, I missed that in my patch however we realised
> >> >> > that adding it before flush_work() also has issues [1]. More
> >> >> > specifically:
> >> >>
> >> >> Ohk. right.
> >> >>
> >> >> >
> >> >> > **kjournald2**
> >> >> > jbd2_journal_commit_transaction()
> >> >> > ...
> >> >> > ext4_handle_error()
> >> >> > /* s_journal_destorying is not set */
> >> >> > if (journal && !s_journal_destorying)
> >> >>
> >> >> Then maybe we should not schedule another work to update the superblock
> >> >> via journalling, it the error itself occurred while were trying to
> >> >> commit the journal txn?
> >> >>
> >> >>
> >> >> -ritesh
> >> >
> >> > Hmm, ideally yes that should not happen, but how can we achieve that?
> >> > For example with the trace we saw:
> >> >
> >> > **kjournald2**
> >> > jbd2_journal_commit_transaction()
> >> > jbd2_journal_get_descriptor_buffer
> >> > jbd2_journal_bmap
> >> > ext4_journal_bmap
> >> > ext4_map_blocks
> >> > ...
> >> > ext4_inode_error
> >> > ext4_handle_error
> >> > schedule_work(&sbi->s_sb_upd_work)
> >> >
> >> > How do we tell ext4_handle_error that it is in the context of a
> >> > committing txn.
So I was thinking about this. It is not a problem to determine we are
running in kjournald context - it is enough to check
current == EXT4_SB(sb)->s_journal->j_task
But I'm not sure checking this in ext4_handle_error() and doing direct sb
update instead of scheduling a journalled one is always correct. For
example kjournald does also writeback of ordered data and if that hits an
error, we do not necessarily abort the journal (well, currently we do as
far as I'm checking but it seems a bit fragile to rely on this).
So I'd rather keep the solution for these umount issues specific to the
umount path. What if we did:
static void ext4_journal_destroy(struct super_block *sb)
{
/*
* At this point only two things can be operating on the journal.
* JBD2 thread performing transaction commit and s_sb_upd_work
* issuing sb update through the journal. Once we set
* EXT4_FLAGS_JOURNAL_DESTROY, new ext4_handle_error() calls will not
* queue s_sb_upd_work and ext4_force_commit() makes sure any
* ext4_handle_error() calls from the running transaction commit are
* finished. Hence no new s_sb_upd_work can be queued after we
* flush it here.
*/
set_bit(EXT4_FLAGS_JOURNAL_DESTROY, &EXT4_SB(sb)->s_ext4_flags);
ext4_force_commit(sb);
flush_work(&EXT4_SB(sb)->s_sb_upd_work);
jbd2_journal_destroy(EXT4_SB(sb)->s_journal);
}
And then add the check to ext4_handle_error():
/*
* In case the fs should keep running, we need to writeout
* superblock through the journal. Due to lock ordering
* constraints, it may not be safe to do it right here so we
- * defer superblock flushing to a workqueue.
+ * defer superblock flushing to a workqueue. We just need
+ * to be careful when the journal is already shutting down.
+ * If we get here in that case, just update the sb directly
+ * as the last transaction won't commit anyway.
*/
- if (continue_fs && journal)
+ if (continue_fs && journal &&
+ !test_bit(EXT4_FLAGS_JOURNAL_DESTROY,
+ &EXT4_SB(sb)->s_ext4_flags))
schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
else
ext4_commit_super(sb);
What do people think about this? Am I missing some possible race?
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists