lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <7b9c7a42-de7b-4408-91a6-1c35e14cc380@dupond.be> Date: Wed, 2 Jul 2025 15:43:25 +0200 From: Jean-Louis Dupond <jean-louis@...ond.be> To: linux-ext4@...r.kernel.org Subject: Re: ext4 metadata corruption - snapshot related? We updated a machine to a newer 6.15.2-1.el8.elrepo.x86_64 kernel, and the same? bug reoccurred after some time: The error was the following: Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Jul 02 11:04:03 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Any idea's on how this could be debugged further? Thanks Jean-Louis On 12/06/2025 16:43, Jean-Louis Dupond wrote: > Hi, > > We have around 200 VM's running on qemu (on a AlmaLinux 9 based > hypervisor). > All those VM's are migrated from physical machines recently. > > But when we enable backups on those VM's (which triggers snapshots), > we notice some weird/random ext4 corruption within the VM itself. > The VM itself runs CloudLinux 8 (4.18.0-553.40.1.lve.el8.x86_64 kernel). > > This are some examples of corruption we see: > 1) > kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: > inode #19280823: comm lsphp: Directory block failed checksum > kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode > #19280823: comm lsphp: Directory block failed checksum > kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: > inode #19280820: comm lsphp: Directory block failed checksum > kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode > #19280820: comm lsphp: Directory block failed checksum > > 2) > kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode > #49419787: comm lsphp: deleted inode referenced: 49422454 > kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode > #49419787: comm lsphp: deleted inode referenced: 49422454 > kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode > #49419787: comm lsphp: deleted inode referenced: 49422454 > > 3) > kernel: EXT4-fs error (device sdb1): ext4_validate_block_bitmap:384: > comm kworker/u240:3: bg 308: bad block bitmap checksum > kernel: EXT4-fs (sdb1): Delayed block allocation failed for inode > 2513946 at logical offset 2 with max blocks 1 with error 74 > kernel: EXT4-fs (sdb1): This should not happen!! Data will be lost > kernel: EXT4-fs (sdb1): Inode 2513946 (00000000265d63ca): > i_reserved_data_blocks (1) not cleared! > kernel: EXT4-fs (sdb1): error count since last fsck: 1 > kernel: EXT4-fs (sdb1): initial error at time 1747923211: > ext4_validate_block_bitmap:384 > kernel: EXT4-fs (sdb1): last error at time 1747923211: > ext4_validate_block_bitmap:384 > kernel: EXT4-fs (sdb1): error count since last fsck: 1 > kernel: EXT4-fs (sdb1): initial error at time 1747923211: > ext4_validate_block_bitmap:384 > kernel: EXT4-fs (sdb1): last error at time 1747923211: > ext4_validate_block_bitmap:384 > > 4) > kernel: EXT4-fs (sdc1): error count since last fsck: 4 > kernel: EXT4-fs (sdc1): initial error at time 1746616017: > ext4_validate_block_bitmap:384 > kernel: EXT4-fs (sdc1): last error at time 1746621676: > ext4_mb_generate_buddy:808 > > > Now as a test we upgraded to some newer (backported) kernel, more > specificly: 5.14.0-284.1101 > And after doing some backups again, we had another error: > > kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: > inode #34752060: comm tar: Directory block failed checksum > kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: > inode #34752232: comm tar: No space for directory leaf checksum. > Please run e2fsck -D. > kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: > inode #34752232: comm tar: Directory block failed checksum > kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: > inode #34752064: comm tar: No space for directory leaf checksum. > Please run e2fsck -D. > kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: > inode #34752064: comm tar: Directory block failed checksum > kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: > inode #34752167: comm tar: No space for directory leaf checksum. > Please run e2fsck -D. > kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: > inode #34752167: comm tar: Directory block failed checksum > > > So now we are wondering what could cause this corruption here. > - We have more VM's on the same kind of setup, without seeing any > corruption. The only difference there is that the VM's are running > Debian, have smaller disks and not doing quota. > - If we disable backups/snapshots, no corruption is observed > - Even if we disable the qemu-guest-agent (so no fsfreeze is > executed), the corruption still occurs > > We (for now at least) only see the corruption on filesystems where > quota is enabled (both usrjquota and usrquota). > The filesystems are between 600GB and 2TB. > And today I noticed (as the filesystems are resized during setup), the > journal size is only 64M (could this potentially be an issue?). > > The big question in the whole story here is, could it be an in-guest > (ext4?) bug/issue? Or do we really need to look into the layer below > (aka qemu/hypervisor). > Or if somebody has other idea's, feel free to share! Also additional > things that could help to troubleshoot the issue. > > Thanks > Jean-Louis
Powered by blists - more mailing lists