[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250922124128.GD481137@mit.edu>
Date: Mon, 22 Sep 2025 08:41:28 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Andrea Biardi <Andrea.Biardi@...visolutions.com>
Cc: linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: ext4: failed to convert unwritten extents (6.12.31 regression)
On Mon, Sep 22, 2025 at 11:11:15AM +0000, Andrea Biardi wrote:
>
> The CI process of a product that I'm working on involves the creation of a temporary KVM VM which boots a cdrom image containing a custom kernel + busybox in order to flash a filesystem image to /dev/vda, then shuts it down and exports the VM (that's my "deliverable" for the next stage).
>
> [ 174.903010] I/O error, dev vda, sector 167922 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
> [ 174.903023] I/O error, dev vda, sector 167938 op 0x1:(WRITE) flags 0x4000 phys_seg 254 prio class 0
> [ 174.903027] I/O error, dev vda, sector 169970 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
> [ 174.903031] EXT4-fs warning (device vda1): ext4_end_bio:353: I/O error 10 writing to inode 16 starting block 84985)
The failure is coming from the block device, which in your case, is
the virtio device. The only causes for this are:
1) An underlying hardware failure
2) A bug in the block virtio device
3) A bug in the VMM (I assume qemu in your case).
The bug might be triggered by a change in the behavior of ext4, but
ultimately, there is nothing that a file system can do that could
result in an I/O error other than (1), (2), or (3), above.
The only thing I can suggest is to do a full bisection between 6.12.30
and 6.12.31. Or take a look at commits that were landed between
6.12.30 and 6.12.31, focusing on changes in /drivers/block,
/drivers/virtio, and /block. I doubt that it's /block, given that no
one else is reporting it.
One other thing you might to try is to changing your qemu
configuration to use virtio-scsi or NVMe emulation. Most commercial
cloud products (e.g., Amazon, Azure, Google Cloud) tend to use
emulated SCSI and NVMe, instead of virtio-blk. It's true that
virtio-blk is more efficient, but the virtual SCSI and NVMe devices
are more similar to Real Hardware(tm), which is why commercial cloud
products tend to use them; they tend to easier for companies doing
"lift and shift". As a result, it's likely that issues with
virtio-blk might not be noticed, given that it gets fewer amounts of
testing.
I do regular regression testing of ext4 using Google Cloud[1], and it
uses either SCSI or NVMe devices (depending on whether the VM type
supports SCSI or NVMe --- the more expensive, higher performance VM's
tend to use NVMe because allows better performance for the
high-performance block devices). While I *can* run kvm-xfstests using
virtio-blk, but when gce-xfstests takes 2-3 hours of wall clock time
(running on a dozen VM's running in parallel), or 24 hours if I were
to run the identical tests using kvm-xfstests, there's a reason why I
rarely use kvm-xfstests/qemu-xfstests. If I'm someplace without
network access, and all I have is qemu using MacOS's Hypervisor
Framework (hvf) on my Macbook Air, sure, I'll use qemu-xfstests. But
it's not something I'll do unless I don't have any other alternatives.
[1] https://thunk.org/gce-xfstests
Cheers,
- Ted
Powered by blists - more mailing lists