linux-ext4 - [Bug 102731] I have a cough.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bug-102731-13602-iGtpT7UGll@https.bugzilla.kernel.org/>
Date:	Mon, 28 Sep 2015 17:06:41 +0000
From:	bugzilla-daemon@...zilla.kernel.org
To:	linux-ext4@...r.kernel.org
Subject: [Bug 102731] I have a cough.

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #13 from Theodore Tso <tytso@....edu> ---
So it's been 12 days, and previously when you were using the Debian 3.16
kernel, it was triggering once every four days, right?  Can I assume that your
silence indicates that you haven't seen a problem to date?

If so, then it really does seen that it might be an interaction between LVM/MD
and KVM.

So if that's the case, then the next thing to ask is to try to figure out what
might be the triggering cause.   A couple of things come to mind:

1) Some failure to properly handle a flush cache command being sent to the MD
device.  This combined to either a power failure or a crash of the guest OS
(depending on how KVM is configured), might explain a block update getting
lost.   The fact that the block bitmap is out of sync with the block group
descriptor is consistent with this failure.  However, if you were seeing
failures once every four days, that would imply that the guest OS and/or host
OS would be crashing at that or about that level of frequency, and you haven't
reported that. 

2) Some kind a race between a 4k write and a RAID1 resync leading to a block
write getting lost.  Again, this reported data corruption is consistent with
this theory --- but this also requires the guest OS crashing due to some kind
of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as
in (1) above.  If you weren't seeing these failures once every four days or so,
then this isn't a likely explanation.

3)  Some kind of corruption caused by the TRIM command being sent to the
RAID/MD device, possibly racing with a block bitmap update.  This could be
caused either by the file system being mounted with the -o discard mount
option, or by fstrim getting run out of cron, or by e2fsck explicitly being
asked to discard unused blocks (with the "-E discard" option).

4)  Some kind of bug which happens rarely either in qemu, the host kernel or
the guest kernel depending on how it communicates with the virtual disk. 
(i.e., virtio, scsi, ide, etc.)   Virtio is the most likely use case, and so
trying to change to use scsi emulation might be interesting.  (OTOH, if the
problem is specific to the MD layer, then this possibility is less likely.)

So as far as #3 is concerned, can you check to see if you had fstrim enabled,
or are mounting the file system with -o discard?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html