linux-ext4 - Re: Exciting :-( adventures in metadata checksumming

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120808234239.4443.qmail@science.horizon.com>
Date:	8 Aug 2012 19:42:39 -0400
From:	"George Spelvin" <linux@...izon.com>
To:	linux-ext4@...r.kernel.org, tytso@....edu
Cc:	linux@...izon.com
Subject: Re: Exciting :-( adventures in metadata checksumming

> Can someone find a workaround QUICKLY?  I can't keep this FS read-only
> for long.

I thought I had figured out a great workaround: Use 1.42.4, which doesn't
know how to check checksums.

But then I doscovered that it aborts and delivers a zero-length file
if there are filesystem inconsistencies, too!  So I get

e2image 1.42.4 (12-Jun-2012)
Illegal block number passed to ext2fs_mark_block_bitmap #3571066296 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #2895243190 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #3276895043 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #2488200263 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #2556839855 for in-use block map
... snip... (2671 total "Illegal block number passed" messages)
Illegal block number passed to ext2fs_mark_block_bitmap #3421917394 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #3469830505 for in-use block map
e2image: Illegal indirect block found while iterating over inode 85800474

I'm not sure this is The Right Thing To Do for a debugging tool.

The file system is a RAID-6, and repeated verifications have failed to find
RAID mismatches.

I am starting to suspect motherboard/RAM on this machine.  Already the bad
magic number error patterns looked odd to me, and I was just reminded that
we had to swap the RAM when it was first built so memtest8 would pass.
We ran it for many hours, but it *is* a consumer Intel box with no ECC.

And 8 GiB of RAM, and acting primarily as a file server, so FS metadata can
sit and bit-rot in RAM for a very long time.

I'm going to play with "hdparm -f" and drop_caches to see if I can make
the file system problems go away with no repair other than re-reading
from disk.

If so, That would confirm it as not ext4's problem.  Although it *would* be
a very cool debugging feature to re-check the checksum whenever a metadata
page is discarded from the buffer cache.

If the checksum matched when first read in, and doesn't when a supposedly
clean page is discarded, *something* is corrupting RAM.  (If you
assume that it's a single bit flip, then you can deduce the location
from the error syndrome.)

Anyway, thanks for the help!
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html