linux-ext4 - Re: ext4 metadata corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b9c7a42-de7b-4408-91a6-1c35e14cc380@dupond.be>
Date: Wed, 2 Jul 2025 15:43:25 +0200
From: Jean-Louis Dupond <jean-louis@...ond.be>
To: linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption - snapshot related?

We updated a machine to a newer 6.15.2-1.el8.elrepo.x86_64 kernel, and 
the same? bug reoccurred after some time:

The error was the following:
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932
Jul 02 11:04:03 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932

Any idea's on how this could be debugged further?

Thanks
Jean-Louis

On 12/06/2025 16:43, Jean-Louis Dupond wrote:
> Hi,
>
> We have around 200 VM's running on qemu (on a AlmaLinux 9 based 
> hypervisor).
> All those VM's are migrated from physical machines recently.
>
> But when we enable backups on those VM's (which triggers snapshots), 
> we notice some weird/random ext4 corruption within the VM itself.
> The VM itself runs CloudLinux 8 (4.18.0-553.40.1.lve.el8.x86_64 kernel).
>
> This are some examples of corruption we see:
> 1)
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: 
> inode #19280823: comm lsphp: Directory block failed checksum
> kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode 
> #19280823: comm lsphp: Directory block failed checksum
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: 
> inode #19280820: comm lsphp: Directory block failed checksum
> kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode 
> #19280820: comm lsphp: Directory block failed checksum
>
> 2)
> kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode 
> #49419787: comm lsphp: deleted inode referenced: 49422454
> kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode 
> #49419787: comm lsphp: deleted inode referenced: 49422454
> kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode 
> #49419787: comm lsphp: deleted inode referenced: 49422454
>
> 3)
> kernel: EXT4-fs error (device sdb1): ext4_validate_block_bitmap:384: 
> comm kworker/u240:3: bg 308: bad block bitmap checksum
> kernel: EXT4-fs (sdb1): Delayed block allocation failed for inode 
> 2513946 at logical offset 2 with max blocks 1 with error 74
> kernel: EXT4-fs (sdb1): This should not happen!! Data will be lost
> kernel: EXT4-fs (sdb1): Inode 2513946 (00000000265d63ca): 
> i_reserved_data_blocks (1) not cleared!
> kernel: EXT4-fs (sdb1): error count since last fsck: 1
> kernel: EXT4-fs (sdb1): initial error at time 1747923211: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdb1): last error at time 1747923211: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdb1): error count since last fsck: 1
> kernel: EXT4-fs (sdb1): initial error at time 1747923211: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdb1): last error at time 1747923211: 
> ext4_validate_block_bitmap:384
>
> 4)
> kernel: EXT4-fs (sdc1): error count since last fsck: 4
> kernel: EXT4-fs (sdc1): initial error at time 1746616017: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdc1): last error at time 1746621676: 
> ext4_mb_generate_buddy:808
>
>
> Now as a test we upgraded to some newer (backported) kernel, more 
> specificly: 5.14.0-284.1101
> And after doing some backups again, we had another error:
>
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752060: comm tar: Directory block failed checksum
> kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: 
> inode #34752232: comm tar: No space for directory leaf checksum. 
> Please run e2fsck -D.
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752232: comm tar: Directory block failed checksum
> kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: 
> inode #34752064: comm tar: No space for directory leaf checksum. 
> Please run e2fsck -D.
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752064: comm tar: Directory block failed checksum
> kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: 
> inode #34752167: comm tar: No space for directory leaf checksum. 
> Please run e2fsck -D.
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752167: comm tar: Directory block failed checksum
>
>
> So now we are wondering what could cause this corruption here.
> - We have more VM's on the same kind of setup, without seeing any 
> corruption. The only difference there is that the VM's are running 
> Debian, have smaller disks and not doing quota.
> - If we disable backups/snapshots, no corruption is observed
> - Even if we disable the qemu-guest-agent (so no fsfreeze is 
> executed), the corruption still occurs
>
> We (for now at least) only see the corruption on filesystems where 
> quota is enabled (both usrjquota and usrquota).
> The filesystems are between 600GB and 2TB.
> And today I noticed (as the filesystems are resized during setup), the 
> journal size is only 64M (could this potentially be an issue?).
>
> The big question in the whole story here is, could it be an in-guest 
> (ext4?) bug/issue? Or do we really need to look into the layer below 
> (aka qemu/hypervisor).
> Or if somebody has other idea's, feel free to share! Also additional 
> things that could help to troubleshoot the issue.
>
> Thanks
> Jean-Louis