linux-kernel - Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87jz8zhmcv.fsf@gmail.com>
Date: Sat, 08 Mar 2025 15:31:52 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: Zhang Yi <yi.zhang@...weicloud.com>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Jan Kara <jack@...e.cz>, Baokun Li <libaokun1@...wei.com>, linux-kernel@...r.kernel.org, Mahesh Kumar <maheshkumar657g@...il.com>, linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>
Subject: Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

Zhang Yi <yi.zhang@...weicloud.com> writes:

> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>> destroyed eg:
>>>>>>>>>
>>>>>>>>> (umount)
>>>>>>>>> ext4_kill_sb
>>>>>>>>>   kill_block_super
>>>>>>>>>     generic_shutdown_super
>>>>>>>>>       sync_filesystem /* commits all txns */
>>>>>>>>>       evict_inodes
>>>>>>>>>         /* might start a new txn */
>>>>>>>>>       ext4_put_super
>>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>>         jbd2_journal_destroy
>>>>>>>>>           journal_kill_thread
>>>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>>           jbd2_journal_commit_transaction
>>>>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>>>>               jbd2_journal_bmap
>>>>>>>>>                 ext4_journal_bmap
>>>>>>>>>                   ext4_map_blocks
>>>>>>>>>                     ...
>>>>>>>>>                     ext4_inode_error
>>>>>>>>>                       ext4_handle_error
>>>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>
>>>>>>>>>                                                /* work queue kicks in */
>>>>>>>>>                                                update_super_work
>>>>>>>>>                                                  jbd2_journal_start
>>>>>>>>>                                                    start_this_handle
>>>>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>>>>
>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>
>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>> 2d01ddc86606).
>>>>>>>>>
>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>
>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@...il.com>
>>>>>>>>> Suggested-by: Jan Kara <jack@...e.cz>
>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
>>>>>>>>> ---
>>>>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>>  	 */
>>>>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>>>>  
>>>>>>>>> +	bool s_journal_destorying;
>>>>>>>>> +
>>>>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>>>>  	unsigned int s_awu_min;
>>>>>>>>>  	unsigned int s_awu_max;
>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>>  {
>>>>>>>>>  	int err = 0;
>>>>>>>>>  
>>>>>>>>> +	/*
>>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>>> +	 * than going via journal.
>>>>>>>>> +	 */
>>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Hi, Ojaswin!
>>>>>>>>
>>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>>> occur.
>>>>>>>>
>>>>>>>>  ext4_put_super()
>>>>>>>>   flush_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>>                     **kjournald2**
>>>>>>>>                     jbd2_journal_commit_transaction()
>>>>>>>>                     ...
>>>>>>>>                     ext4_inode_error()
>>>>>>>>                       /* JBD2_UNMOUNT not set */
>>>>>>>>                       schedule_work(s_sb_upd_work)
>>>>>>>>
>>>>>>>>                                   **workqueue**
>>>>>>>>                                    update_super_work
>>>>>>>>                                    /* s_journal_destorying is not set */
>>>>>>>>                             	   if (journal && !s_journal_destorying)
>>>>>>>>
>>>>>>>>   ext4_journal_destroy()
>>>>>>>>    /* set s_journal_destorying */
>>>>>>>>    sbi->s_journal_destorying = true;
>>>>>>>>    jbd2_journal_destroy()
>>>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>
>>>>>>>>                                        jbd2_journal_start()
>>>>>>>>                                         start_this_handle()
>>>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yi.
>>>>>>> Hi Yi,
>>>>>>>
>>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>>
>>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>>> helper. Just wondering if I should still let it be in
>>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>>
>>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>>
>>>>>>> Regards,
>>>>>>> ojaswin
>>>>>>
>>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>>> it?
>>>>>>
>>>>>>   ext4_put_super()
>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>  
>>>>>>                      **kjournald2**
>>>>>>                      jbd2_journal_commit_transaction()
>>>>>>                      ...
>>>>>>                      ext4_inode_error()
>>>>>>                        /* JBD2_UNMOUNT not set */
>>>>>>                        schedule_work(s_sb_upd_work)
>>>>>>  
>>>>>>                                     **workqueue**
>>>>>>                                     update_super_work
>>>>>>                                     /* s_journal_destorying is not set */
>>>>>>                              	      if (journal && !s_journal_destorying)
>>>>>>  
>>>>>>    ext4_journal_destroy()
>>>>>>     /* set s_journal_destorying */
>>>>>>     sbi->s_journal_destorying = true;
>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>                                       schedule_work()
>>>>>                                         ^^^^^^^^^^^^^^^
>>>>>                                         where does this come from?
>>>>>
>>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>>> not commit transaction or call schedule_work() again because it has
>>>>> been aborted due to the previous error. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Yi.
>>>>
>>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
>>>>
>>>>  if(journal && !is_journal_destroying) 
>>>>
>>>> is computed but schedule_work() not called yet, which is possible cause
>>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>>> is only called after we have done the flush then we end up with this:
>>>>
>>>>                               	      if (journal && !s_journal_destorying)
>>>>     ext4_journal_destroy()
>>>>      /* set s_journal_destorying */
>>>>      sbi->s_journal_destorying = true;
>>>>      flush_work(&sbi->s_sb_upd_work)
>>>>                                        schedule_work()
>>>>
>>>> Which is possible IMO, although the window is tiny.
>>>
>>> Yeah, right!
>>> Sorry for misread the location where you add the "!s_journal_destorying"
>>> check, the graph I provided was in update_super_work(), which was wrong.
>> 
>> Oh right, I also misread your trace but yes as discussed, even 
>> 
>>     sbi->s_journal_destorying = true;
>> 		flush_work()
>>     jbd2_journal_destroy()
>> 
>> doesn't work.
>> 
>>> The right one should be:
>>>
>>>  ext4_put_super()
>>>   flush_work(&sbi->s_sb_upd_work)
>>>
>>>                     **kjournald2**
>>>                     jbd2_journal_commit_transaction()
>>>                     ...
>>>                     ext4_inode_error()
>>>                       /* s_journal_destorying is not set */
>>>                       if (journal && !s_journal_destorying)
>>>                         (schedule_work(s_sb_upd_work))  //can be here
>>>
>>>   ext4_journal_destroy()
>>>    /* set s_journal_destorying */
>>>    sbi->s_journal_destorying = true;
>>>    jbd2_journal_destroy()
>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>
>>>                         (schedule_work(s_sb_upd_work))  //also can be here
>>>
>>>                                   **workqueue**
>>>                                    update_super_work()
>>>                                    journal = sbi->s_journal //get journal
>>>     kfree(journal)
>>>                                      jbd2_journal_start(journal) //journal UAF
>>>                                        start_this_handle()
>>>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
>>>
>>>
>>> So there are two problems here, the first one is the 'journal' UAF,
>>> the second one is triggering JBD2_UNMOUNT flag BUGON.
>> 
>> Indeed, there's a possible UAF here as well.
>> 
>>>
>>>>>>
>>>>>> As for the fix, how about we do something like this:
>>>>>>
>>>>>>   ext4_put_super()
>>>>>>
>>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>>
>>>>>>    ext4_journal_destroy()
>>>>>>     /* set s_journal_destorying */
>>>>>>     sbi->s_journal_destorying = true;
>>>>>>
>>>>>>    /* trigger a commit and wait for it to complete */
>>>>>>
>>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>>
>>>>>>     jbd2_journal_destroy()
>>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>>  
>>>>>>                                         jbd2_journal_start()
>>>>>>                                          start_this_handle()
>>>>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>>>>
>>>>>> Still giving this codepath some thought but seems like this might just
>>>>>> be enough to fix the race. Thoughts on this?
>>>>>>
>>>
>>> I think this solution should work, the forced commit and flush_work()
>>> should ensure that the last transaction is committed and that the
>>> potential work is done.
>>>
>>> Besides, the s_journal_destorying flag is set and check concurrently
>>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>>> about adding a new flag into sbi->s_mount_state instead of adding
>>> new s_journal_destorying?
>> 
>> Right, that makes sence. I will incorporate these changes in the next 
>> revision.
>> 
>
> Think about this again, it seems that we no longer need the destroying
> flag. Because we force to commit and wait for the **last** transaction to
> complete, and the flush work should also ensure that the last sb_update
> work to complete. Regardless of whether it starts a new handle in the
> last update_super_work(), it will not commit since the journal should
> have aborted. What are your thoughts?
>

I think the confusion maybe coming because v2 patch isn't where we
discussed to put the s_journal_destroying to true, in this thread [1]

[1]: https://lore.kernel.org/linux-ext4/jnxpphuradrsf73cxfmohfu7wwwckihtulw6ovsitddgt5pqkg@2uoejkr66qnl/


>  ext4_put_super()

   + sbi->s_journal_destroying = true;

We should add s_journal_destroying to true before calling for
flush_work. 

>   flush_work(&sbi->s_sb_upd_work)

After the above flush work is complete, we will always check if
s_journal_destroying is set. If yes, then we should never schedule the
sb update work

>   destroy_workqueue(sbi->rsv_conversion_wq)
>
>   ext4_journal_destroy()
>    /* trigger a commit (it will commit the last trnasaction) */
>
>                     **kjournald2**
>                     jbd2_journal_commit_transaction()
>                     ...
>                      ext4_inode_error()
>                       schedule_work(s_sb_upd_work))

Then this schedule work will never happen, since it will check if
s_journal_destroying flag is set. 

>
>                                      **workqueue**
>                                       update_super_work()
>                                         jbd2_journal_start(journal)
>                                           start_this_handle()
>                                           //This new trans will
>                                           //not be committed.
>
>                      jbd2_journal_abort()
>
>    /* wait for it to complete */
>
>    flush_work(&sbi->s_sb_upd_work)

No need to again call the flush work here, since there is no new work
which will be scheduled right.

Am I missing something?

-ritesh


>    jbd2_journal_destroy()
>     journal->j_flags |= JBD2_UNMOUNT;
>    jbd2_journal_commit_transaction() //it will commit nothing
>
> Thanks,
> Yi.