lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <bug-102731-13602-iGtpT7UGll@https.bugzilla.kernel.org/> Date: Mon, 28 Sep 2015 17:06:41 +0000 From: bugzilla-daemon@...zilla.kernel.org To: linux-ext4@...r.kernel.org Subject: [Bug 102731] I have a cough. https://bugzilla.kernel.org/show_bug.cgi?id=102731 --- Comment #13 from Theodore Tso <tytso@....edu> --- So it's been 12 days, and previously when you were using the Debian 3.16 kernel, it was triggering once every four days, right? Can I assume that your silence indicates that you haven't seen a problem to date? If so, then it really does seen that it might be an interaction between LVM/MD and KVM. So if that's the case, then the next thing to ask is to try to figure out what might be the triggering cause. A couple of things come to mind: 1) Some failure to properly handle a flush cache command being sent to the MD device. This combined to either a power failure or a crash of the guest OS (depending on how KVM is configured), might explain a block update getting lost. The fact that the block bitmap is out of sync with the block group descriptor is consistent with this failure. However, if you were seeing failures once every four days, that would imply that the guest OS and/or host OS would be crashing at that or about that level of frequency, and you haven't reported that. 2) Some kind a race between a 4k write and a RAID1 resync leading to a block write getting lost. Again, this reported data corruption is consistent with this theory --- but this also requires the guest OS crashing due to some kind of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as in (1) above. If you weren't seeing these failures once every four days or so, then this isn't a likely explanation. 3) Some kind of corruption caused by the TRIM command being sent to the RAID/MD device, possibly racing with a block bitmap update. This could be caused either by the file system being mounted with the -o discard mount option, or by fstrim getting run out of cron, or by e2fsck explicitly being asked to discard unused blocks (with the "-E discard" option). 4) Some kind of bug which happens rarely either in qemu, the host kernel or the guest kernel depending on how it communicates with the virtual disk. (i.e., virtio, scsi, ide, etc.) Virtio is the most likely use case, and so trying to change to use scsi emulation might be interesting. (OTOH, if the problem is specific to the MD layer, then this possibility is less likely.) So as far as #3 is concerned, can you check to see if you had fstrim enabled, or are mounting the file system with -o discard? -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists