linux-kernel - Re: EXT4-fs error, kernel BUG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140805125114.GG5263@thunk.org>
Date:	Tue, 5 Aug 2014 08:51:14 -0400
From:	Theodore Ts'o <tytso@....edu>
To:	linux kernel mailing list <linux-kernel@...r.kernel.org>
Cc:	martin f krafft <madduck@...duck.net>
Subject: Re: EXT4-fs error, kernel BUG

On Tue, Aug 05, 2014 at 12:34:36PM +0200, martin f krafft wrote:
> Dear kernel people,
> 
> Yesterday, I encountered something weird on one of our NAS machines:
> 
>   Aug  4 20:09:40 julia kernel: [342873.007709] EXT4-fs error (device dm-6): ext4_ext_check_inode:481: inode #30414321: comm du: pblk 0 bad header/extent: invalid extent entries - magic f30a, entries 1, max 4(4), depth 0(0)
> 
> but a fsck -f of the filesystem revealed no problems.

One likely cause of this issue is that the hardware hiccuped on a
read, and returned garbage, which is what triggered the "EXT4-fs
error" message (which is really a report of a detect file system
inconsistency).  A common cause of this is the block address getting
corrupted, so that the hard drive read the correct data from the wrong
location.

The other likely cause is that you are using something like RAID1, and
the one of copies of the disk block really is corrupted, and the
kernel read the bad version of the block, but fsck managed to read the
good version of the block.

It's possible that this was caused by a memory corruption, but it
wouldn't have been high on my suspect list.  Still, if this is a new
machine, it might not be a bad idea to run memtest86+ for 24-48 hours.

> So I set up another filesystem and tried to copy over the data from
> /dev/dm-6, using tar.
> 
> Shortly afterwards, there a wall message like
> 
>   BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:28]

>From the stack traces, it looks like the system was thrashing trying
to free memory to make forward progess.  (i.e., due to high memory
pressure).  Exactly why this happened is not something I can determine
from the strack traces, sorry.  It could be that soft lockup happened,
you had more processes running, or that some of the processes (samba?
apache?) were using more memory, and this was a factor.  Why the OOM
killer didn't kill the processes I can't tell you.

> Is there anything in the following back traces that would help me
> identify the source of the problem with greater confidence?

Sorry, that's about how that can be divined from your kernel stack
traces.

It might be worth checking the system logs for any suspicious error
messages beyond just the EXT4-fs error message, but you may have done
that already.

Good luck,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/