linux-ext4 - Re: ext4 metadata corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <23e0b748-57b4-452c-9a39-04f941aef996@dupond.be>
Date: Wed, 2 Jul 2025 17:32:43 +0200
From: Jean-Louis Dupond <jean-louis@...ond.be>
To: Theodore Ts'o <tytso@....edu>
Cc: linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption - snapshot related?


On 2/07/2025 16:37, Theodore Ts'o wrote:
> On Wed, Jul 02, 2025 at 03:43:25PM +0200, Jean-Louis Dupond wrote:
>> We updated a machine to a newer 6.15.2-1.el8.elrepo.x86_64 kernel, and the
>> same? bug reoccurred after some time:
>>
>> The error was the following:
>> Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
>> inode #44962812: comm imap: deleted inode referenced: 44997932
>> Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
>> inode #44962812: comm imap: deleted inode referenced: 44997932
>> Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
>> inode #44962812: comm imap: deleted inode referenced: 44997932
>> Jul 02 11:04:03 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791:
>> inode #44962812: comm imap: deleted inode referenced: 44997932
>>
>> Any idea's on how this could be debugged further?
> If it's correlated to snapshots, then I'd certainly be trying to
> looking at potential bugs on the hypervisor.  We've also had a bug
> where people were trying to look at bugs on the guest kernel, but the
> bug ended up being root caused to a bug on the host kernel.
That is something we are surely investigating.
But for some reason we only see the bug on our CloudLinux machines, not 
on Debian machines (where we have 20 times more of them on the same 
platform).
The fact that it only happens when running snapshots could also be 
related to the small freeze during snapshotting and then causing some 
race somewhere when IO is flushed after the freeze.
>
> If moving from 4.18 Cloudlinux 8 kernel to a 6.15.2 RHEL8 kernel shows
> the same problem, then it does suggest that the problem isn't with the
> guest kernel, but rather in the part of the setup which didn't change
> (e.g., the host kernel and hypervisor).
Well it still shows 'corruption', but the message sometimes differs.
For example previously we also had:
htree_dirblock_to_tree:1036: inode #19280823: comm lsphp: Directory 
block failed checksum

error message etc.

The strange thing here is we only observe ext4 metadata corruption.
For now we didn't had (or see) any corruption within files.
So if it would be a hypervisor issue for example, I would suspect random 
corruption and not only metadata corruption.
>
> Without a whole lot more details about what your workload might be,
> what the host OS software might be, etc., it's really hard to make any
> further suggestions.  Are you running this on some kind of cloud
> infrastructure (e.g., Microsoft Azure, Amazon AWS, Google Cloud, etc?
> Something else?  Have you tried running your workload on some kind of
> alternate infrastructure and see if the problem gets solved if you use
> a different Cloud provider?
The workload is a typical all-in-one webserver with mail and db.
So MySQL/PHP/Apache/Postfix/...
This runs on CloudLinux 8 with the LVE Kernel module (which could be a 
cause also of course).

We are running those VM's on our in-house platform based on Qemu + Libvirt.
The hypervisors are running AlmaLinux 9 with Qemu 9.1.0.

>
> 						- Ted
Thanks for having a look!
Jean-Louis