lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 10 Apr 2014 10:03:16 -0400
From:	Theodore Ts'o <tytso@....edu>
To:	Nathaniel W Filardo <nwf@...jhu.edu>
Cc:	Theodore Tso <tytso@...gle.com>, Mike Rubin <mrubin@...gle.com>,
	Frank Mayhar <fmayhar@...gle.com>, admins@....jhu.edu,
	linux-ext4@...r.kernel.org
Subject: Re: ext4 metadata corruption bug?

On Thu, Apr 10, 2014 at 01:04:28AM -0400, Nathaniel W Filardo wrote:
> We use QEMU directives like
> 
>         -drive format=raw,file=rbd:rbdafs-mirror/mirror-0,id=drive5,if=none,cache=writeback \
>         -device driver=ide-hd,drive=drive5,discard_granularity=512,bus=ahci0.3
> 
> We've never had, so far as I know, an unexpected shutdown of the QEMU
> process, so I don't think that unexpected loss of cache contents is to
> blame.
> 
> Perhaps the dmesg I sent was not representative; some days ago, we saw, only
> (comparatively!) late in the machine's uptime:
> 
> [309894.428685] EXT4-fs (sdd): pa ffff88000d9f9440: logic 832, phys.  957458972, len 192
> [309894.430023] EXT4-fs error (device sdd): ext4_mb_release_inode_pa:3729: group 29219, free 192, pa_free 191
> [309894.431822] Aborting journal on device sdd-8.
> [309894.442913] EXT4-fs (sdd): Remounting filesystem read-only
> 
> with Debian kernel 3.13.5-1; sdd here is the same filesystem as in the
> earlier dmesg.

What is your workload?  Can you reproduce this easily?  And can you
try using a local disk to see if the problem goes away, so we can
start to bisect which software components might be at fault?

I'm not aware of any corruption problem with a 3.13 based kernel which
matches your signature, and the ext4 errors that you are showing
(minor accounting discrepancies in the number free blocks and number
of free inodes between the allocation bitmap and the summary
statistics in the block group descriptors) is very closely matches the
signature of some part of the storage stack not honoring FLUSH CACHE
("barrier") operations, either by ignoring them completely, or
reordring writes across a barrier / flush cache request.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ