linux-ext4 - Re: ext4: journal has aborted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140704122022.GC10514@thunk.org>
Date:	Fri, 4 Jul 2014 08:20:22 -0400
From:	Theodore Ts'o <tytso@....edu>
To:	David Jander <david@...tonic.nl>
Cc:	Dmitry Monakhov <dmonakhov@...nvz.org>,
	Matteo Croce <technoboy85@...il.com>,
	"Darrick J. Wong" <darrick.wong@...cle.com>,
	linux-ext4@...r.kernel.org
Subject: Re: ext4: journal has aborted

On Fri, Jul 04, 2014 at 01:28:02PM +0200, David Jander wrote:
> 
> Here is the output I am getting... AFAICS no problems on the raw device. Is
> this sufficient testing, Ted?

I'm not sure what theory Dmitry was trying to pursue when he requested
that you run the fio test.  Dmitry?

Please note that at this point there may be multiple causes with
similar symptoms that are showing up.  So just because one person
reports one set of data points, such as someone claiming they've seen
this without a power drop to the storage device, that therefore all of
the problems were caused by flaky I/O to the device.

Right now, there are multiple theories floating around --- and it may
be that more than one of them are true (i.e., there may be multiple
bugs here).  Some of the possibilities, which again, may not be
mutually exclusive:

1) Some kind of eMMC driver bug, which is possibly causing the CACHE
FLUSH command not to be sent.

2) Some kind of hardware problem involving flash translation layers
not having durable transactions of their flash metadata across power
failures.

3) Some kind of ext4/jbd2 bug, recently introduced, where we are
modifying some ext4 metadata (either the block allocation bitmap or
block group summary statistics) outside of a valid transaction handle.

4) Some other kind of hard-to-reproduce race or wild pointer which is
sometimes corrupting fs data structures.

If someone has a easy to reproduce failure case, the first step is to
do a very rough bisection test.  Does the easy-to-reproduce failure go
away if you use 3.14?  3.12?  Also, if you can describe in great
detail your hardware and software configuration, and under what
circumstances the problem reproduces, and when it doesn't, that would
also be critical.  Whether you are just doing reset or a power cycle
if an unclean shutdown is involved, might also be important.

And at this point, because I'm getting very suspicious that there may
be more than one root cause, we should try to keep the debugging of
one person's reproduction, such as David's, separate from another's,
such as Matteo's.  It may be that there ultimately have the same root
cause, and so if one person is able to get an interesting reproduction
result, it would be great for the other person to try running the same
experiment on their hardware/software configuration.  But what we must
not do is assume that one person's experiment is automatically
applicable to other circumstances.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html