linux-ext4 - Re: ext4: journal has aborted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140701163646.GA3126@wallace>
Date:	Tue, 1 Jul 2014 12:36:46 -0400
From:	Eric Whitney <enwlinux@...il.com>
To:	Theodore Ts'o <tytso@....edu>
Cc:	Jaehoon Chung <jh80.chung@...sung.com>,
	"Darrick J. Wong" <darrick.wong@...cle.com>,
	Matteo Croce <technoboy85@...il.com>,
	David Jander <david@...tonic.nl>, linux-ext4@...r.kernel.org
Subject: Re: ext4: journal has aborted

* Theodore Ts'o <tytso@....edu>:
> On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
> > Hi,
> > 
> > i have interesting for this problem..Because i also found the same problem..
> > Is it Journal problem?
> > 
> > I used the Linux version 3.16.0-rc3.
> > 
> > [    3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
> > [    3.877937] Aborting journal on device mmcblk0p13-8.
> > [    3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
> 
> This message means that the file system has detected an inconsistency
> --- specifically, that the number of blocks marked as in use in the
> allocation bbitmap is different from what is in the block group
> descriptors.
> 
> The file system has been marked to force a panic after an error, at
> which point e2fsck will be able to repair the inconsistency.
> 
> What's not clear is *how* the why this happened.  It can happen simply
> because of a hardware problem.  (In particular, not all mmc flash
> devices handle power failures gracefully.)  Or it could be a cosmic,
> ray, or it might be a kernel bug.
> 
> Normally I would chalk this up to a hardware bug, bug it's possible
> that it is a kernel bug.  If people can reliably reproduce the problem
> where no power failures or other unclean shutdowns were involved
> (since the last time file system has been checked using e2fsck) then
> that would be realy interesting.

Hi Ted:

I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
patch) regression on the Pandaboard this morning.  A generic/068 hang
on data_journal required a reboot for recovery (old bug, though rarer lately).
On reboot, the root filesystem - default 4K, and on an SD card - went ro
after the same sort of bad block bitmap / journal abort sequence.  Rebooting
forced a fsck that cleared up the problem.  The target test filesystem was on
a USB-attached disk, and it did not exhibit the same problems on recovery.

So, it looks like there might be more than just hardware involved here, 
although eMMC/flash might be a common denominator.  I'll see if I can come up
with a reliable reproducer once the regression pass is finished if someone
doesn't beat me to it.

Eric


> 
> We should probably also change the message so the message is a bit
> more understanding to people who aren't ext4 developers.
> 
>      		      	     	 	- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html