linux-ext4 - Re: Recovering a damaged ext4 fs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <498CB68C.5030409@redhat.com>
Date:	Fri, 06 Feb 2009 17:15:40 -0500
From:	Ric Wheeler <rwheeler@...hat.com>
To:	"J.D. Bakker" <jdb@...tmaker.nl>
CC:	linux-ext4@...r.kernel.org
Subject: Re: Recovering a damaged ext4 fs - revisited.

J.D. Bakker wrote:
> Hi,
>
> My 4TB ext4 RAID-6 has just become damaged for the second time in two 
> months. While I do have backups for most of my data, it would be good 
> to know if there is a recovery procedure or a way to avoid these 
> crashes. The symptoms are massive group descriptor corruption, similar 
> to what was mentioned in 
> http://thread.gmane.org/gmane.comp.file-systems.ext4/10844 and 
> http://article.gmane.org/gmane.comp.file-systems.ext4/11195 .
What kind of RAID 6 device are you using? Is it MD raid or some vendor 
array? 

Ric


>
> The bad news: on the first occurrence I didn't record any information 
> but decided to zero the partitions and restart from scratch. This 
> second time my kernel is tainted by the nvidia module (as I since 
> switched to an nVidia 8500-card from the Radeon X1300 I'd borrowed to 
> get the system up).
>
> The machine is an Intel i720 on an Asus P6T with 3GB RAM, running 
> 2.6.28 x86_64. /dev/md0 is a RAID-6 over six 1TB drives. Details:
>
> http://lartmaker.nl/ext4/kernel-config.txt
> http://lartmaker.nl/ext4/dmesg.txt
> http://lartmaker.nl/ext4/lspci.txt
> http://lartmaker.nl/ext4/proc-mdstat.txt
> http://lartmaker.nl/ext4/proc-partitions.txt
>
> This afternoon I issued an rm on a file which was a few hundred MB 
> large. The rm process kept running at 100% CPU for over a minute, and 
> could not be terminated through either CTRL-C or kill -9 (process 
> would remain in the 'R'-state). The kernel reported a soft lockup, 
> with the following call trace:
>
>   [<ffffffff8050f1b7>] ? _spin_lock+0x16/0x19
>   [<ffffffff80308a23>] ? ext4_mb_init_cache+0x6d2/0x876
>   [<ffffffff802754de>] ? __lru_cache_add+0x8a/0xb2
>   [<ffffffff80308cd6>] ? ext4_mb_load_buddy+0x10f/0x2f2
>   [<ffffffff80309d15>] ? ext4_mb_free_blocks+0x2b3/0x611
>   [<ffffffff802f0aa8>] ? ext4_free_blocks+0x75/0xa8
>   [<ffffffff80303839>] ? ext4_ext_truncate+0x3f9/0x832
>   [<ffffffff802f848e>] ? ext4_truncate+0x67/0x5bc
>   [<ffffffff80316279>] ? jbd2_journal_dirty_metadata+0x124/0x146
>   [<ffffffff80305ba6>] ? __ext4_journal_dirty_metadata+0x1e/0x46
>   [<ffffffff802f3e9b>] ? ext4_mark_iloc_dirty+0x3fa/0x463
>   [<ffffffff802f4a81>] ? ext4_mark_inode_dirty+0x134/0x147
>   [<ffffffff802f8b2b>] ? ext4_delete_inode+0x148/0x209
>   [<ffffffff802f89e3>] ? ext4_delete_inode+0x0/0x209
>   [<ffffffff802a7472>] ? generic_delete_inode+0x82/0x108
>   [<ffffffff8029ff76>] ? do_unlinkat+0xe2/0x13b
>   [<ffffffff8050f8ba>] ? error_exit+0x0/0x70
>   [<ffffffff8020bf5a>] ? system_call_fastpath+0x16/0x1b
>
> (full log at http://lartmaker.nl/ext4/softlock-log.txt).
>
> The system was otherwise still responsive, as long as processes didn't 
> access the ext4 fs on the RAID array. I tried to halt the system, 
> which did not work. Finally I powered the machine down manually.
>
> On reboot the system refused to auto-fsck /dev/md0. A manual e2fsck 
> -nv /dev/md0 reported:
>
>   e2fsck 1.41.4 (27-Jan-2009)
>   ./e2fsck/e2fsck: Group descriptors look bad... trying backup blocks...
>   Group descriptor 0 checksum is invalid.  Fix? no
>   Group descriptor 1 checksum is invalid.  Fix? no
>   Group descriptor 2 checksum is invalid.  Fix? no
>   [...]
>   Group descriptor 29808 checksum is invalid.  Fix? no
>   newraidfs contains a file system with errors, check forced.
>   Pass 1: Checking inodes, blocks, and sizes
>   Pass 2: Checking directory structure
>   Pass 3: Checking directory connectivity
>   Pass 4: Checking reference counts
>   Pass 5: Checking group summary information
>   Block bitmap differences:  [...]
>   Fix? no
>   Free blocks count wrong for group #0 (23513, counted=464).
>   Fix? no
>   Free blocks count wrong for group #1 (31743, counted=509).
>   Fix? no
>   [...]
>   Free inodes count wrong for group #7748 (8192, counted=940).
>   Fix? no
>   Directories count wrong for group #7748 (0, counted=1).
>   Fix? no
>   Free inodes count wrong for group #7749 (8192, counted=8059).
>   Fix? no
>   Free inodes count wrong (244195317, counted=237646747).
>   Fix? no
>   newraidfs: ***** FILE SYSTEM WAS MODIFIED *****
>   newraidfs: ********** WARNING: Filesystem still has errors **********
>         11 inodes used (0.00%)
>      41796 non-contiguous files (379963.6%)
>       3002 non-contiguous directories (27290.9%)
>            # of inodes with ind/dind/tind blocks: 0/0/0
>            Extent depth histogram: 4423417/4694/3
>   15377150 blocks used (1.57%)
>          0 bad blocks
>        106 large files
>
>    3738164 regular files
>     685644 directories
>       3663 character device files
>       8709 block device files
>         19 fifos
>    2180635 links
>      47335 symbolic links (43028 fast symbolic links)
>         54 sockets
>   --------
>    6664223 files
>   Error writing block 1 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   Error writing block 2 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   Error writing block 3 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   [...]
>   Error writing block 231 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   Error writing block 232 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>
> (full log at http://lartmaker.nl/ext4/e2fsck-md0.txt)
>
> As suggested in the earlier threads I ran dumpe2fs; once without the 
> -b option, once with -b 32768 and once with -b 98304:
>
> http://lartmaker.nl/ext4/dumpe2fs-md0.txt
> http://lartmaker.nl/ext4/dumpe2fs-md0-32768.txt
> http://lartmaker.nl/ext4/dumpe2fs-md0-98304.txt
>
> Output of findsuper:
>
> http://lartmaker.nl/ext4/findsuper.txt
>
> Please let me know if you need more information.
>
> As I said, is there anything I can do to recover my data, or to make 
> sure this doesn't happen again?
>
> Thanks,
>
> JDB.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html