[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <485FCC25.7090401@theshore.net>
Date: Mon, 23 Jun 2008 12:15:33 -0400
From: "Christopher S. Aker" <caker@...shore.net>
To: linux-kernel@...r.kernel.org
CC: xen devel <xen-devel@...ts.xensource.com>
Subject: ext3 directory corruption under Xen
We've been seeing a rash of ext3 directory corruption occurring under
Xen. All but one of the reports have been with filesystems formatted
with 1024 blocksize. We have one report, that's potentialy the same
bug, occurring on a filesystem with 4096 blocksize (either way, it was
some type of corruption in that case). In all cases, the filesystems
were mounted with ext3's default journaling mode. No quotas or anything
else other than the default ext3 mount options.
It's happened on a number of different hosts, all of the same hardware
and software configuration (Xen 3.2 64bit, 32bit pae dom0, 32bit pae
domUs. LVM backend with 3ware hardware RAID-1). Some of those hosts
were previously running non-virtaulized Linux and UML, using the
identical guest images, and under that configuration never experienced
this problem.
This has occurred under both 2.6.18-xenbits and the more recent pv_ops
based kernels (2.6.24, 2.6.25), which I presume are all using the same
blkfront driver code.
The common workloads from the reports seems to be active maildirs and rsync.
The initial errors reported back are all from fs/ext3/dir.c, in
ext3_check_dir_entry(). Most commonly hit is the "rec_len % 4 != 0"
check. We've seen other checks trigger, but my assumption is that those
happen after more stuff gets whacked out.
Eventually the fs will go read-only. In extreme cases, the fs is chewed
through enough that data is lost.
It's tricky to track down the trigger because you can only detect the
corruption after it's happened. Our attempts to reproduce this using
various filesystem thrashing scripts haven't yielded a reliable way to
trigger it, however we have been successful in triggering it twice -- in
two weeks :( .
My hope is that this triggers an "a-hah" from someone in LKML or Xen
land who has experience with this code, or that this is a known issue
and a fix already lives.
We're scared. Please help.
Thanks,
-Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists