linux-ext4 - Re: fsck infinite loop on corrupt ext4 file system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1250557822.23227.9.camel@bobble.smo.corp.google.com>
Date:	Mon, 17 Aug 2009 18:10:22 -0700
From:	Frank Mayhar <fmayhar@...gle.com>
To:	linux-ext4@...r.kernel.org
Cc:	tytso@....edu
Subject: Re: fsck infinite loop on corrupt ext4 file system

On Fri, 2009-08-14 at 16:55 -0700, Frank Mayhar wrote:
> Hello, folks.  We recently ran into a pretty severe ext4 crash (being
> worked on by someone else) that caused some seriously corrupted file
> systems, one of which in turn exposed an fsck problem.  We noticed this
> when fsck started looping endlessly trying to correct that file system.
> Basically, the group descriptors were mangled; fsck complains about
> invalid checksums, forces a full check and during pass 1 tries to
> allocate some inode bitmap blocks (apparently).  That allocation fails,
> pass 1 errors out and starts the check over.  Endlessly.

I've made a little more progress since Friday.  I had grabbed a dumpe2fs
dump of the corrupted file system and one of the newly-created file
system on the same device.  Adjusting for normal variation (numbers of
free blocks, flags, etc.), there are no differences _except_ in the very
block groups that fsck complained about having bad checksums.  For those
(and only those), the locations of the block bitmap and inode table
differ.  I've attached the diff output.

In particular, block group 276 claims to have its inode table at blocks
0-204, which is clearly wrong.  This is the block group for which the
allocation failed, causing the original loop.

It's clear that fsck is neither correcting the block groups nor is it
detecting the bad entries properly (a sanity check might be in order
here).  It's not even noticing that it's looping, it just keeps failing
the allocation and retrying.  While it may be that fsck can't recover
the file system in this case, it should at least notice and abort.

My thinking is that the location of the inode tables should be invariant
over the life of the file system.  Certainly there's no place in ext4
itself that changes those fields (that I can see, anyway).  Why couldn't
fsck compute the proper values and compare those against what's there?
-- 
Frank Mayhar <fmayhar@...gle.com>
Google, Inc.

View attachment "dump-diff" of type "text/x-patch" (30384 bytes)