linux-ext4 - Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9e92601-53bc-42a2-b428-e61bff6153c5@huaweicloud.com>
Date: Fri, 7 Mar 2025 16:43:24 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Jan Kara <jack@...e.cz>, Baokun Li <libaokun1@...wei.com>,
 linux-kernel@...r.kernel.org, Mahesh Kumar <maheshkumar657g@...il.com>,
 linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>
Subject: Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if
 journal is destroying



On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>> possible for update_super_work to start a transaction after the journal is
>>>> destroyed eg:
>>>>
>>>> (umount)
>>>> ext4_kill_sb
>>>>   kill_block_super
>>>>     generic_shutdown_super
>>>>       sync_filesystem /* commits all txns */
>>>>       evict_inodes
>>>>         /* might start a new txn */
>>>>       ext4_put_super
>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>         jbd2_journal_destroy
>>>>           journal_kill_thread
>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>           jbd2_journal_commit_transaction
>>>>             jbd2_journal_get_descriptor_buffer
>>>>               jbd2_journal_bmap
>>>>                 ext4_journal_bmap
>>>>                   ext4_map_blocks
>>>>                     ...
>>>>                     ext4_inode_error
>>>>                       ext4_handle_error
>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>
>>>>                                                /* work queue kicks in */
>>>>                                                update_super_work
>>>>                                                  jbd2_journal_start
>>>>                                                    start_this_handle
>>>>                                                      BUG_ON(journal->j_flags &
>>>>                                                             JBD2_UNMOUNT)
>>>>
>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>
>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>> during ext4_put_super() (except a running transaction that will get commited
>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>> outside the journal as it won't race with a journaled update (refer
>>>> 2d01ddc86606).
>>>>
>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>
>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>> Reported-by: Mahesh Kumar <maheshkumar657g@...il.com>
>>>> Suggested-by: Jan Kara <jack@...e.cz>
>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
>>>> ---
>>>>  fs/ext4/ext4.h      | 2 ++
>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>  fs/ext4/super.c     | 4 +++-
>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>  	 */
>>>>  	struct work_struct s_sb_upd_work;
>>>>  
>>>> +	bool s_journal_destorying;
>>>> +
>>>>  	/* Atomic write unit values in bytes */
>>>>  	unsigned int s_awu_min;
>>>>  	unsigned int s_awu_max;
>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>> --- a/fs/ext4/ext4_jbd2.h
>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>  {
>>>>  	int err = 0;
>>>>  
>>>> +	/*
>>>> +	 * At this point all pending FS updates should be done except a possible
>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>> +	 * than going via journal.
>>>> +	 */
>>>> +	sbi->s_journal_destorying = true;
>>>> +
>>>
>>> Hi, Ojaswin!
>>>
>>> I'm afraid you still need to flush the superblock update work here,
>>> otherwise I guess the race condition you mentioned in v1 could still
>>> occur.
>>>
>>>  ext4_put_super()
>>>   flush_work(&sbi->s_sb_upd_work)
>>>
>>>                     **kjournald2**
>>>                     jbd2_journal_commit_transaction()
>>>                     ...
>>>                     ext4_inode_error()
>>>                       /* JBD2_UNMOUNT not set */
>>>                       schedule_work(s_sb_upd_work)
>>>
>>>                                   **workqueue**
>>>                                    update_super_work
>>>                                    /* s_journal_destorying is not set */
>>>                             	   if (journal && !s_journal_destorying)
>>>
>>>   ext4_journal_destroy()
>>>    /* set s_journal_destorying */
>>>    sbi->s_journal_destorying = true;
>>>    jbd2_journal_destroy()
>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>
>>>                                        jbd2_journal_start()
>>>                                         start_this_handle()
>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>
>>> Thanks,
>>> Yi.
>> Hi Yi,
>>
>> Yes you are right, somehow missed this edge case :(
>>
>> Alright then, we have to move out sbi->s_journal_destroying outside the
>> helper. Just wondering if I should still let it be in
>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>> before schedule_work(s_sb_upd_work), because it makes sense.
>>
>> Okay let me give it some thought but thanks for pointing this out!
>>
>> Regards,
>> ojaswin
> 
> Okay so thinking about it a bit more, I see you also suggested to flush
> the work after marking sbi->s_journal_destroying. But will that solve
> it?
> 
>   ext4_put_super()
>    flush_work(&sbi->s_sb_upd_work)
>  
>                      **kjournald2**
>                      jbd2_journal_commit_transaction()
>                      ...
>                      ext4_inode_error()
>                        /* JBD2_UNMOUNT not set */
>                        schedule_work(s_sb_upd_work)
>  
>                                     **workqueue**
>                                     update_super_work
>                                     /* s_journal_destorying is not set */
>                              	      if (journal && !s_journal_destorying)
>  
>    ext4_journal_destroy()
>     /* set s_journal_destorying */
>     sbi->s_journal_destorying = true;
>     flush_work(&sbi->s_sb_upd_work)
>                                       schedule_work()
                                        ^^^^^^^^^^^^^^^
                                        where does this come from?

After this flush_work, we can guarantee that the running s_sb_upd_work
finishes before we set JBD2_UNMOUNT. Additionally, the journal will
not commit transaction or call schedule_work() again because it has
been aborted due to the previous error. Am I missing something?

Thanks,
Yi.

>     jbd2_journal_destroy()
>      journal->j_flags |= JBD2_UNMOUNT;
>  
>                                         jbd2_journal_start()
>                                          start_this_handle()
>                                            BUG_ON(JBD2_UNMOUNT)
> 
> 
> Seems like these edge cases keep sprouting up :)
> 
> As for the fix, how about we do something like this:
> 
>   ext4_put_super()
> 
>    flush_work(&sbi->s_sb_upd_work)
>    destroy_workqueue(sbi->rsv_conversion_wq);
> 
>    ext4_journal_destroy()
>     /* set s_journal_destorying */
>     sbi->s_journal_destorying = true;
> 
>    /* trigger a commit and wait for it to complete */
> 
>     flush_work(&sbi->s_sb_upd_work)
> 
>     jbd2_journal_destroy()
>      journal->j_flags |= JBD2_UNMOUNT;
>  
>                                         jbd2_journal_start()
>                                          start_this_handle()
>                                            BUG_ON(JBD2_UNMOUNT)
> 
> Still giving this codepath some thought but seems like this might just
> be enough to fix the race. Thoughts on this?
> 
> Regards,
> ojaswin
> 
>>>
>>>>  	err = jbd2_journal_destroy(journal);
>>>>  	sbi->s_journal = NULL;
>>>>  
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index 8ad664d47806..31552cf0519a 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>>>>  		 * constraints, it may not be safe to do it right here so we
>>>>  		 * defer superblock flushing to a workqueue.
>>>>  		 */
>>>> -		if (continue_fs && journal)
>>>> +		if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>>>>  			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>>>>  		else
>>>>  			ext4_commit_super(sb);
>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>>>>  	spin_lock_init(&sbi->s_error_lock);
>>>>  	INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>>>>  
>>>> +	sbi->s_journal_destorying = false;
>>>> +
>>>>  	err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>>>>  	if (err)
>>>>  		goto failed_mount3;
>>>
>