linux-ext4 - ext4 corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <87pqmrobix.fsf@algae.riseup.net>
Date:	Sun, 05 Jun 2011 23:59:34 -0400
From:	Micah Anderson <micah@...eup.net>
To:	linux-ext4@...r.kernel.org
Subject: ext4 corruption

I previously wrote about a recent conversion from ext3 to ext4 (on
Debian Squeeze), which went well. However, I seem to be having problems
with the ext4 filesystem.

Yesterday, there was a file in /var/spool/postfix/defer that was giving
an i/o error:

Jun  3 15:00:14 willet postfix/qmgr[29108]: fatal: qmgr_message_alloc:
677AE298316F: remove defer 677AE298316F: Input/output error

If I tried to stat it, it would give the same error. I noticed on the
console, I was getting a lot of these:

[6060479.296658] EXT4-fs error (device dm-4): ext4_lookup: deleted inode referenced: 169640807
[6060482.776087] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of 
                  system crash.

The system was clearly acting strange, so I decided it was best to touch
/forcefsk and restart to clean up the filesystem.

I got a couple Multiply-claimed block(s), "(There are 10 inodes
containing multiply-claimed blocks.)", and then I was required to run
fsck again, which I did and it seemed to be fine after the second run
(these fscks took hours). 

After things seemed clean, I started the system back up and it began to
operate fine. I then began to see the following on the console:

[ 3201.702997] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429952(bit 3456 in group 1722)
[ 3201.714348] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429953(bit 3457 in group 1722)
[ 3201.725665] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429954(bit 3458 in group 1722)
[ 3201.737028] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429955(bit 3459 in group 1722)
[ 3201.748721] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429956(bit 3460 in group 1722)
[ 3201.760021] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429957(bit 3461 in group 1722)
[ 3201.771489] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429958(bit 3462 in group 1722)
[ 3201.782908] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429959(bit 3463 in group 1722)
[ 3201.794281] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429960(bit 3464 in group 1722)
[ 3201.805664] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429961(bit 3465 in group 1722)
[ 3201.818936] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 3202.289345] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 3202.328925] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

I'm concerned that this happened so quickly after a fsck resolved
issues.

The filesystem is on top of a software raid mirror, so I failed one set
and ran S.M.A.R.T. short/long tests on the device, re-added it to the
array, waited the 8hours for the resync, and then did the same thing
with the other element of the array. All smart tests completed without
error.

I took the machine down to add another disk to the system so I could
have more flexibility to be able to run badblocks tests, and when the
system came back up a fsck of the partition was required. Its been
running for 3 hours now, and so far it has only said "Duplicate or bad
block in use!" so I presume it is scanning the entire device for
duplicate blocks. This is what it did the previous fsck. 

Last time it took 8 hours to complete the first pass, and then it had to
do another pass after a reboot, which took 1.5-4hrs (i was sleeping when
it finished). So we've out for a number of hours now, which is quite
bad. 

Its certainly possible that this is not a filesystem issue, and instead
a hardware one, the badblocks tests should give us more conclusive
information. I would love any additional suggestions for what we can do
to conclusively identify what the issue is.

thanks for reading, and any thoughts you might have!

micah

Content of type "application/pgp-signature" skipped