linux-ext4 - Re: ext4 metadata corruption bug?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140410050428.GV10985@gradx.cs.jhu.edu>
Date:	Thu, 10 Apr 2014 01:04:28 -0400
From:	Nathaniel W Filardo <nwf@...jhu.edu>
To:	Theodore Tso <tytso@...gle.com>
Cc:	Mike Rubin <mrubin@...gle.com>, Frank Mayhar <fmayhar@...gle.com>,
	admins@....jhu.edu, linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption bug?

On Wed, Apr 09, 2014 at 10:55:48PM -0400, Theodore Tso wrote:
> Hi Nathaniel,
> 
> In general, it's best if you send these sorts of requests for help to the
> linux-ext4@...r.kernel.org mailing list.

Added to CC.

> The fact that we see the "error count" line early in the boot message
> suggests to me that your VM is not running fsck to fix up the errors before
> mounting the file system.  (Well, either that or you're using a really
> ancient version of e2fsck, but given that you're using a bleeding edge
> kernel, but I'm guessing you're using a reasonably recent version of
> e2fsck.  But that would be good for you to check.)

e2fsck version is 1.42.9 using the same library version.
 
> The ext4 error message is due to the file system getting corrupted.  How
> the file system got corrupted isn't 100% clear, but one potential cause is
> how the disk is configured with qemu.
>[snip]

We use QEMU directives like

        -drive format=raw,file=rbd:rbdafs-mirror/mirror-0,id=drive5,if=none,cache=writeback \
        -device driver=ide-hd,drive=drive5,discard_granularity=512,bus=ahci0.3

We've never had, so far as I know, an unexpected shutdown of the QEMU
process, so I don't think that unexpected loss of cache contents is to
blame.

Perhaps the dmesg I sent was not representative; some days ago, we saw, only
(comparatively!) late in the machine's uptime:

[309894.428685] EXT4-fs (sdd): pa ffff88000d9f9440: logic 832, phys.  957458972, len 192
[309894.430023] EXT4-fs error (device sdd): ext4_mb_release_inode_pa:3729: group 29219, free 192, pa_free 191
[309894.431822] Aborting journal on device sdd-8.
[309894.442913] EXT4-fs (sdd): Remounting filesystem read-only

with Debian kernel 3.13.5-1; sdd here is the same filesystem as in the
earlier dmesg.

I'll capture any subsequent crashes and follow up.

Thanks much!
--nwf;

Content of type "application/pgp-signature" skipped