lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce935905-cc96-4a12-8779-5380c535b9b4@proxmox.com>
Date: Thu, 15 Jan 2026 14:27:47 +0100
From: Friedrich Weber <f.weber@...xmox.com>
To: Jean-Louis Dupond <jean-louis@...ond.be>, linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption - snapshot related?

Hi,

On 12/06/2025 16:43, Jean-Louis Dupond wrote:
> Hi,
> 
> We have around 200 VM's running on qemu (on a AlmaLinux 9 based hypervisor).
> All those VM's are migrated from physical machines recently.
> 
> But when we enable backups on those VM's (which triggers snapshots), we notice some weird/random ext4 corruption within the VM itself.
> The VM itself runs CloudLinux 8 (4.18.0-553.40.1.lve.el8.x86_64 kernel).

I'm currently looking into an issue that sounds similar (some more details below)
and wanted to ask: Did you have any luck in debugging this further?

The affected user is running different QEMU+KVM VMs on Proxmox VE on different
hosts, and occasionally (every few weeks), a VM will report some kind of ext4
metadata corruption. Two examples from different VMs:

kernel: EXT4-fs error (device dm-1): ext4_validate_block_bitmap:420: comm kworker/u24:3: bg 1923: bad block bitmap checksum
kernel: EXT4-fs (dm-1): Delayed block allocation failed for inode 15601703 at logical offset 0 with max blocks 517 with error 74
kernel: EXT4-fs (dm-1): This should not happen!! Data will be lost

kernel: EXT4-fs error (device dm-1): ext4_validate_block_bitmap:420: comm logrotate: bg 30: bad block bitmap checksum
kernel: EXT4-fs error (device dm-1) in ext4_mb_clear_bb:6170: Filesystem failed CRC

Similar to your case, so far no actual data corruption has been noticed.

The hosts are on different kernel versions, e.g. downstream kernels based on
6.5, 6.11 or 6.14. QEMU versions also differ, some downstream builds are based
on 9.2.0, some on 10.0.2.

All disks of affected VMs are backed by SAN storages (each VM disk is an LVM LV
in raw format on top of a LUN accessed via iSCSI+multipath) We initially
suspected some issue with the SAN and ran some tests in that direction, but so
far didn't notice anything off.

So far, affected VMs have been Ubuntu and Debian VMs, with ext4 on top of LVM.
Since the issue happens so sporadically, it's difficult to see a pattern
separating affected/unaffected VMs. As you mention quota as a potential factor,
I checked some recently affected VMs, none had quota enabled. Disk size doesn't
seem to be a factor either, some affected disks are ~15GB, some are multi-TB
disks.

No storage-level snapshots are taken of the VMs, but daily backups are enabled
(with fsfreeze via qemu-guest-agent).

Perhaps there is an unfortunate interaction between host/guest kernel and QEMU
that could trigger this and that only affects ext4 metadata for some reason?

Best wishes,

Friedrich


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ