linux-ext4 - Re: ext4 metadata corruption bug?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140410140316.GD15925@thunk.org>
Date:	Thu, 10 Apr 2014 10:03:16 -0400
From:	Theodore Ts'o <tytso@....edu>
To:	Nathaniel W Filardo <nwf@...jhu.edu>
Cc:	Theodore Tso <tytso@...gle.com>, Mike Rubin <mrubin@...gle.com>,
	Frank Mayhar <fmayhar@...gle.com>, admins@....jhu.edu,
	linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption bug?

On Thu, Apr 10, 2014 at 01:04:28AM -0400, Nathaniel W Filardo wrote:
> We use QEMU directives like
> 
>         -drive format=raw,file=rbd:rbdafs-mirror/mirror-0,id=drive5,if=none,cache=writeback \
>         -device driver=ide-hd,drive=drive5,discard_granularity=512,bus=ahci0.3
> 
> We've never had, so far as I know, an unexpected shutdown of the QEMU
> process, so I don't think that unexpected loss of cache contents is to
> blame.
> 
> Perhaps the dmesg I sent was not representative; some days ago, we saw, only
> (comparatively!) late in the machine's uptime:
> 
> [309894.428685] EXT4-fs (sdd): pa ffff88000d9f9440: logic 832, phys.  957458972, len 192
> [309894.430023] EXT4-fs error (device sdd): ext4_mb_release_inode_pa:3729: group 29219, free 192, pa_free 191
> [309894.431822] Aborting journal on device sdd-8.
> [309894.442913] EXT4-fs (sdd): Remounting filesystem read-only
> 
> with Debian kernel 3.13.5-1; sdd here is the same filesystem as in the
> earlier dmesg.

What is your workload?  Can you reproduce this easily?  And can you
try using a local disk to see if the problem goes away, so we can
start to bisect which software components might be at fault?

I'm not aware of any corruption problem with a 3.13 based kernel which
matches your signature, and the ext4 errors that you are showing
(minor accounting discrepancies in the number free blocks and number
of free inodes between the allocation bitmap and the summary
statistics in the block group descriptors) is very closely matches the
signature of some part of the storage stack not honoring FLUSH CACHE
("barrier") operations, either by ignoring them completely, or
reordring writes across a barrier / flush cache request.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html