linux-kernel - Re: [PATCH 1/2] ext4: only defer sb update on error if SB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84ebed94-d4ab-4b8c-962c-f10ebaca5a75@huawei.com>
Date: Thu, 27 Feb 2025 20:51:11 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
CC: <linux-ext4@...r.kernel.org>, Theodore Ts'o <tytso@....edu>, Jan Kara
	<jack@...e.cz>, <linux-kernel@...r.kernel.org>, Mahesh Kumar
	<maheshkumar657g@...il.com>, Ritesh Harjani <ritesh.list@...il.com>, Yang
 Erkun <yangerkun@...wei.com>
Subject: Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE

On 2025/2/27 19:43, Ojaswin Mujoo wrote:
> On Wed, Feb 26, 2025 at 09:55:52AM +0800, Baokun Li wrote:
>> On 2025/2/25 20:06, Ojaswin Mujoo wrote:
>>> On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
>>>> On 2025/2/22 16:40, Ojaswin Mujoo wrote:
>>>>> Presently we always BUG_ON if trying to start a transaction on a journal
>>>>> marked with JBD2_UNMOUNT, since this should never happen. However while
>>>>> running stress tests it was observed that in case of some error handling
>>>>> paths, it is possible for update_super_work to start a transaction after
>>>>> the journal is destroyed eg:
>>>>>
>>>>> (umount)
>>>>> ext4_kill_sb
>>>>>      kill_block_super
>>>>>        generic_shutdown_super
>>>>>          sync_filesystem /* commits all txns */
>>>>>          evict_inodes
>>>>>            /* might start a new txn */
>>>>>          ext4_put_super
>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>            jbd2_journal_destroy
>>>>>              journal_kill_thread
>>>>>                journal->j_flags |= JBD2_UNMOUNT;
>>>>>              jbd2_journal_commit_transaction
>>>>>                jbd2_journal_get_descriptor_buffer
>>>>>                  jbd2_journal_bmap
>>>>>                    ext4_journal_bmap
>>>>>                      ext4_map_blocks
>>>>>                        ...
>>>>>                        ext4_inode_error
>>>> Just curious, since jbd2_journal_bmap() only queries the map and does not
>>>> create it, how does it fail here? Is there more information in dmesg?
>>>> Is s_journal_inum normal after file system corruption?
>>> Hey Baokun,
>> Hello Ojaswin,
>>
>> Thanks for your detailed explanation!
>>> So I dug a bit more into the vmcore. The error information in sbi looks
>>> like this:
>>>
>>>     s_add_error_count = 1,
>>>     s_first_error_code = 117,
>>>     s_first_error_line = 475,
>>>     s_first_error_ino = 0,
>>>     s_first_error_block = 0,
>>>     s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
>>>     s_first_error_time = 1737023235,
>>>
>>>     s_last_error_code = 117,
>>>     s_last_error_line = 609,
>>>     s_last_error_ino = 8,
>>>     s_last_error_block = 783,
>>>     s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
>>>     s_last_error_time = 1737023236,
>>>
>>>     The first error is here:
>>>
>>>         if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
>>>        474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
>>>     *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
>>>        476                              "block_group %u", bitmap_blk, block_group);
>>>        477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
>>>        478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
>>>        479                   return ERR_PTR(-EFSCORRUPTED);
>>>        480           }
>>>
>>> and the last error is here:
>>>
>>>       608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
>>>    *  609                   ret = check_block_validity(inode, map);
>>>       610                   if (ret != 0)
>>>       611                           return ret;
>>>       612           }
>>>
>>>
>>> And indeed we have the traces of the first error in dmesg:
>>>
>>> [75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>>> [75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>>> [75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
>>>
>>> However, the last error seems strange. It seems like check_block_validity
>>> should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
>>> not recorded in the crash dump for some reason so idk the exact value at the
>>> time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
>>> which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
>>> I'll look a bit more into the proc01 ltp to see if we can recreate the failure
>>> to get more info.
>> Right, check_block_validity() skips the journal inode check. If
>> the journal inode check fails, that shows s_es->s_journal_inum and
>> journal->j_inode->i_ino are different. The file system doesn't modify
>> s_journal_inum, so it should be modified by some other writer bypassing
>> the file system (i.e. writing to bare disk).
>>
>> If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
>> directly or saving s_journal_inum to ext4_sb_info (which offers better
>> compatibility).
> Hey Baokun,
>
> So I'm wondering if modifying the check in __check_block_validity the
> right thing to do. The fact that something has changed the on disk
> journal inode number is itself concerning and maybe stopping IO here is
> the right thing to do instead of using a cached journal inode number and
> silently ignoring the issue.
Because the cached journal inode is fine (it was checked when mounting),
and the data we're committing is good too, I think it's okay to keep
committing.  The actual problem is someone modified the superblock
directly, bypassing the file system, and the file system can't protect
against that.
>
> Anyways now that the journal inode is corrupted on disk, will the
> recovery tooling even be able to read the journal anymore?
Actually, the journal inode on disk could still be okay. This just tells us
s_es->s_journal_inum is abnormal, meaning part of the superblock on disk
got changed. If only the superblock was modified, we can use the backup
superblock or grab the latest superblock copy from the journal.


Cheers,
Baokun