linux-ext4 - Re: ext4 metadata corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250702143755.GB3471@mit.edu>
Date: Wed, 2 Jul 2025 10:37:55 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Jean-Louis Dupond <jean-louis@...ond.be>
Cc: linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption - snapshot related?

On Wed, Jul 02, 2025 at 03:43:25PM +0200, Jean-Louis Dupond wrote:
> We updated a machine to a newer 6.15.2-1.el8.elrepo.x86_64 kernel, and the
> same? bug reoccurred after some time:
> 
> The error was the following:
> Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
> inode #44962812: comm imap: deleted inode referenced: 44997932
> Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
> inode #44962812: comm imap: deleted inode referenced: 44997932
> Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
> inode #44962812: comm imap: deleted inode referenced: 44997932
> Jul 02 11:04:03 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
> inode #44962812: comm imap: deleted inode referenced: 44997932
> 
> Any idea's on how this could be debugged further?

If it's correlated to snapshots, then I'd certainly be trying to
looking at potential bugs on the hypervisor.  We've also had a bug
where people were trying to look at bugs on the guest kernel, but the
bug ended up being root caused to a bug on the host kernel.

If moving from 4.18 Cloudlinux 8 kernel to a 6.15.2 RHEL8 kernel shows
the same problem, then it does suggest that the problem isn't with the
guest kernel, but rather in the part of the setup which didn't change
(e.g., the host kernel and hypervisor).

Without a whole lot more details about what your workload might be,
what the host OS software might be, etc., it's really hard to make any
further suggestions.  Are you running this on some kind of cloud
infrastructure (e.g., Microsoft Azure, Amazon AWS, Google Cloud, etc?
Something else?  Have you tried running your workload on some kind of
alternate infrastructure and see if the problem gets solved if you use
a different Cloud provider?

						- Ted