linux-ext4 - Re: ext4: journal has aborted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140704224645.GN9508@dastard>
Date:	Sat, 5 Jul 2014 08:46:45 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Theodore Ts'o <tytso@....edu>
Cc:	David Jander <david@...tonic.nl>,
	Dmitry Monakhov <dmonakhov@...nvz.org>,
	Matteo Croce <technoboy85@...il.com>,
	"Darrick J. Wong" <darrick.wong@...cle.com>,
	linux-ext4@...r.kernel.org
Subject: Re: ext4: journal has aborted

On Fri, Jul 04, 2014 at 02:45:39PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 04, 2014 at 03:45:59PM +0200, David Jander wrote:
> > > 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> > > FLUSH command not to be sent.
> > 
> > How can I investigate this? According to the fio tests I ran and the
> > explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
> > commands is the only thing left to be discarded on the eMMC driver front,
> > right?
> 
> Can you try using an older kernel?  The report that that I quoted from
> John Stultz (https://lkml.org/lkml/2014/6/12/19) indicated that it was
> a problem that showed up in "recent kernels", and a bisection search
> seemed to point towards an unknown problem in the eMMC driver.
> Quoting from https://lkml.org/lkml/2014/6/12/762:
> 
>     "However, despite many many reboots the last good commit in my
>     branch - bb5cba40dc7f079ea7ee3ae760b7c388b6eb5fc3 (mmc: block:
>     Fixup busy detection while...) doesn't ever show the issue. While
>     the immediately following commit which bisect found -
>     e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 (mmc: mmci: Handle CMD
>     irq before DATA irq) always does.
> 
>     The immensely frustrating part is while backing that single change off
>     from its commit sha always makes the issue go away, reverting that
>     change from on top of v3.15 doesn't. The issue persists....."
> 
> > > 2) Some kind of hardware problem involving flash translation layers
> > > not having durable transactions of their flash metadata across power
> > > failures.
> > 
> > That would be like blaming Micron (the eMMC part manufacturer) for faulty
> > firmware... could be, but how can we test this?
> 
> The problem is that people who write these programs end up doing
> one-offs, as opposed to something that is well packaged and stands the
> test of time.  But basically what we want is a program that writes to
> sequential blocks in a block device with the following information:
> 
> *) a timestamp (seconds and microseconds from gettimeofday)
> *) a 64-bit generation number (which is randomly
>    generated and the same for each run of the progam)
> *) a 32-bit sequence number (starts at zero and
>    increments once per block
> *) a 32-bit "sync" number which is written after each time
>    fsync(2) is called while writing to the disk
> *) the sector number where the data was written
> *) a CRC of the above information
> *) some random pattern to fill the rest of the 512 or 4k block,
>    depending on the physical sector size

genstream + checkstream.

http://oss.sgi.com/projects/nfs/testtools/

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html