linux-ext4 - Re: ext4 metadata corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <dfd63baf-6eb5-435e-afb4-db7ea37b13a1@dupond.be>
Date: Thu, 15 Jan 2026 14:39:12 +0100
From: Jean-Louis Dupond <jean-louis@...ond.be>
To: Friedrich Weber <f.weber@...xmox.com>, linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption - snapshot related?

On 15/01/2026 14:27, Friedrich Weber wrote:
> Hi,
>
> On 12/06/2025 16:43, Jean-Louis Dupond wrote:
>> Hi,
>>
>> We have around 200 VM's running on qemu (on a AlmaLinux 9 based hypervisor).
>> All those VM's are migrated from physical machines recently.
>>
>> But when we enable backups on those VM's (which triggers snapshots), we notice some weird/random ext4 corruption within the VM itself.
>> The VM itself runs CloudLinux 8 (4.18.0-553.40.1.lve.el8.x86_64 kernel).
> I'm currently looking into an issue that sounds similar (some more details below)
> and wanted to ask: Did you have any luck in debugging this further?
We still haven't found the root cause of this unfortunately.
We thought it could be 
https://gitlab.com/qemu-project/qemu/-/commit/8eeaa706ba73251063cb80d87ae838d2d5b08e9a 
, but that didn't solve it on our end.
> The affected user is running different QEMU+KVM VMs on Proxmox VE on different
> hosts, and occasionally (every few weeks), a VM will report some kind of ext4
> metadata corruption. Two examples from different VMs:
>
> kernel: EXT4-fs error (device dm-1): ext4_validate_block_bitmap:420: comm kworker/u24:3: bg 1923: bad block bitmap checksum
> kernel: EXT4-fs (dm-1): Delayed block allocation failed for inode 15601703 at logical offset 0 with max blocks 517 with error 74
> kernel: EXT4-fs (dm-1): This should not happen!! Data will be lost
Exactly the same as we see.
>
> kernel: EXT4-fs error (device dm-1): ext4_validate_block_bitmap:420: comm logrotate: bg 30: bad block bitmap checksum
> kernel: EXT4-fs error (device dm-1) in ext4_mb_clear_bb:6170: Filesystem failed CRC
>
> Similar to your case, so far no actual data corruption has been noticed.
We have cases where we actually saw (mysql) data corruption.
So we shifted focus a bit towards the qemu part, as it might be caused 
there.
>
> The hosts are on different kernel versions, e.g. downstream kernels based on
> 6.5, 6.11 or 6.14. QEMU versions also differ, some downstream builds are based
> on 9.2.0, some on 10.0.2.
>
> All disks of affected VMs are backed by SAN storages (each VM disk is an LVM LV
> in raw format on top of a LUN accessed via iSCSI+multipath) We initially
> suspected some issue with the SAN and ran some tests in that direction, but so
> far didn't notice anything off.
We had the issue on local storage also. So it's most likely not caused 
by that.
>
> So far, affected VMs have been Ubuntu and Debian VMs, with ext4 on top of LVM.
> Since the issue happens so sporadically, it's difficult to see a pattern
> separating affected/unaffected VMs. As you mention quota as a potential factor,
> I checked some recently affected VMs, none had quota enabled. Disk size doesn't
> seem to be a factor either, some affected disks are ~15GB, some are multi-TB
> disks.
>
> No storage-level snapshots are taken of the VMs, but daily backups are enabled
> (with fsfreeze via qemu-guest-agent).
The backups triggers a snapshot? Cause it all seems to correlate with 
taking a snapshot.
We were already able to reproduce it more quickly by just creating a 
snapshot create/delete loop.
>
> Perhaps there is an unfortunate interaction between host/guest kernel and QEMU
> that could trigger this and that only affects ext4 metadata for some reason?
It also happens without guest agent. So then no fsfreeze is executed.
So then guest should not be aware there was some kind of snapshot taken.
If you do a fsfreeze/fsthaw loop, no corruption is observed.

Might still be some qemu bug after all ... But we will only find out 
when its fixed I believe.
>
> Best wishes,
>
> Friedrich
>
Thanks
Jean-Louis