[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bug-102731-13602-ThvJJxnjtl@https.bugzilla.kernel.org/>
Date: Wed, 30 Sep 2015 09:49:21 +0000
From: bugzilla-daemon@...zilla.kernel.org
To: linux-ext4@...r.kernel.org
Subject: [Bug 102731] I have a cough.
https://bugzilla.kernel.org/show_bug.cgi?id=102731
--- Comment #14 from John Hughes <john@...va.com> ---
On 28/09/15 19:06, bugzilla-daemon@...zilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=102731
>
> --- Comment #13 from Theodore Tso <tytso@....edu> ---
> So it's been 12 days, and previously when you were using the Debian 3.16
> kernel, it was triggering once every four days, right? Can I assume that your
> silence indicates that you haven't seen a problem to date?
I haven't seen the problem, but unfortunately I'm running 3.18.19 at the
moment (I screwed up on the last boot and let it boot the default
kernel). I haven't had time to reboot. So I'd like to give it a bit
more time.
>
> If so, then it really does seen that it might be an interaction between LVM/MD
> and KVM.
>
> So if that's the case, then the next thing to ask is to try to figure out what
> might be the triggering cause. A couple of things come to mind:
>
> 1) Some failure to properly handle a flush cache command being sent to the MD
> device. This combined to either a power failure or a crash of the guest OS
> (depending on how KVM is configured), might explain a block update getting
> lost. The fact that the block bitmap is out of sync with the block group
> descriptor is consistent with this failure. However, if you were seeing
> failures once every four days, that would imply that the guest OS and/or host
> OS would be crashing at that or about that level of frequency, and you haven't
> reported that.
I haven't had any host or guest crashes.
>
> 2) Some kind a race between a 4k write and a RAID1 resync leading to a block
> write getting lost. Again, this reported data corruption is consistent with
> this theory --- but this also requires the guest OS crashing due to some kind
> of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as
> in (1) above. If you weren't seeing these failures once every four days or so,
> then this isn't a likely explanation.
No crashes.
>
> 3) Some kind of corruption caused by the TRIM command being sent to the
> RAID/MD device, possibly racing with a block bitmap update. This could be
> caused either by the file system being mounted with the -o discard mount
> option, or by fstrim getting run out of cron, or by e2fsck explicitly being
> asked to discard unused blocks (with the "-E discard" option).
I'm not using "-o discard", or fstrim, I've never used the "-E discard"
option to fsck.
>
> 4) Some kind of bug which happens rarely either in qemu, the host kernel or
> the guest kernel depending on how it communicates with the virtual disk.
> (i.e., virtio, scsi, ide, etc.) Virtio is the most likely use case, and so
> trying to change to use scsi emulation might be interesting. (OTOH, if the
> problem is specific to the MD layer, then this possibility is less likely.)
>
> So as far as #3 is concerned, can you check to see if you had fstrim enabled,
> or are mounting the file system with -o discard?
>
I'm a bit overwhelmed with work at the moment so I haven't had time to
read this message with the care it deserves, I'll get back to you with
more detail next week.
--
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists