linux-ext4 - Re: ext4: journal has aborted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87d2dkfsfo.fsf@openvz.org>
Date:	Sat, 05 Jul 2014 06:30:35 +0400
From:	Dmitry Monakhov <dmonakhov@...nvz.org>
To:	Theodore Ts'o <tytso@....edu>, David Jander <david@...tonic.nl>
Cc:	Matteo Croce <technoboy85@...il.com>,
	"Darrick J. Wong" <darrick.wong@...cle.com>,
	linux-ext4@...r.kernel.org
Subject: Re: ext4: journal has aborted

On Fri, 4 Jul 2014 14:45:39 -0400, Theodore Ts'o <tytso@....edu> wrote:
> On Fri, Jul 04, 2014 at 03:45:59PM +0200, David Jander wrote:
> > > 1) Some kind of eMMC driver bug, which is possibly causing the CACHE
> > > FLUSH command not to be sent.
> > 
> > How can I investigate this? According to the fio tests I ran and the
> > explanation Dmitry gave, I conclude that incorrectly sending of CACHE-FLUSH
> > commands is the only thing left to be discarded on the eMMC driver front,
> > right?
> 
> Can you try using an older kernel?  The report that that I quoted from
> John Stultz (https://lkml.org/lkml/2014/6/12/19) indicated that it was
> a problem that showed up in "recent kernels", and a bisection search
> seemed to point towards an unknown problem in the eMMC driver.
> Quoting from https://lkml.org/lkml/2014/6/12/762:
> 
>     "However, despite many many reboots the last good commit in my
>     branch - bb5cba40dc7f079ea7ee3ae760b7c388b6eb5fc3 (mmc: block:
>     Fixup busy detection while...) doesn't ever show the issue. While
>     the immediately following commit which bisect found -
>     e7f3d22289e4307b3071cc18b1d8ecc6598c0be4 (mmc: mmci: Handle CMD
>     irq before DATA irq) always does.
> 
>     The immensely frustrating part is while backing that single change off
>     from its commit sha always makes the issue go away, reverting that
>     change from on top of v3.15 doesn't. The issue persists....."
> 
> > > 2) Some kind of hardware problem involving flash translation layers
> > > not having durable transactions of their flash metadata across power
> > > failures.
> > 
> > That would be like blaming Micron (the eMMC part manufacturer) for faulty
> > firmware... could be, but how can we test this?
> 
> The problem is that people who write these programs end up doing
> one-offs, as opposed to something that is well packaged and stands the
> test of time.  But basically what we want is a program that writes to
> sequential blocks in a block device with the following information:
> 
> *) a timestamp (seconds and microseconds from gettimeofday)
> *) a 64-bit generation number (which is randomly
>    generated and the same for each run of the progam)
> *) a 32-bit sequence number (starts at zero and
>    increments once per block
> *) a 32-bit "sync" number which is written after each time
>    fsync(2) is called while writing to the disk
> *) the sector number where the data was written
> *) a CRC of the above information
> *) some random pattern to fill the rest of the 512 or 4k block,
>    depending on the physical sector size

I've posted similar utility (hwflush-check) here
http://oss.sgi.com/archives/xfs/2013-05/msg00443.html
We use it for testing power-tolerance for SSD.
It has client and server
# On a server side, start to listen incoming connections
# Stage0:
hwflush-check -l   
############
# On tested node
# Stage 1: Create and allocate all files.
hwflush-check -s $SERVER_IP -d $SCRATCH_MNT --prep 
# Stage 2: Start test, end let it work for a while
hwflush-check -s $SERVER_IP -d $SCRATCH_MNT & sleep 20; wall "it is time
to cut power cord" 
# Stage 3: On next boot check that all data was successfully flushed.
hwflush-check -s $SERVER_IP -d $SCRATCH_MNT --check

It is reasonable to repeate steps 2,3 several times. 
> 
> The program would then write this information to each 512 or 4k block,
> and periodically, say every N blocks (where N should be configurable),
> it would call fsync() which on an open block device, should result in
> a CACHE FLUSH command being sent.
> 
> Ideally, the sequence number and sync number would be output via a
> serial console, or maybe over an ssh connection, so you can see what
> was the last sequence number and sync number that was written before
> the disk, before you yank the power cord out of the wall.
> 
> Then you have write another program which scans the disk and makes
> sure that all of the blocks up until the last sync number that was
> output was in fact preserved faithfully on the disk after the power
> outage.
> 
> If not, you know you have a problem.
> 
> For bonus credit, you can also have a mode where the disk is
> pre-written with some known repeating pattern, such as 0xDEADBEEF, and
> then see if you have some sectors that do not contain either the
> expected 0xDEADBEEF, or the test pattern described above.
> 
> That indicates a potential problem where as power drops, sometimes the
> memory can go insane before the hard drive and the DMA engine manages
> to stop.  (Since as the saying goes, death is not an event, but rather
> a process, and different components inside the computer can die at
> different times.)  SGI discovered this problem very early on, which is
> why SGI machines have extra large capacitors on their power supplies,
> plus a power fail interrupt, so that when they lose power, the system
> can abort DMA transfers before allowing the power to shut down.
> Otherwise, a file system which is doing logical journalling can end up
> with crap getting written on top of some critical file system data
> block, and which replying the journal won't help, since the journal
> just contains things like "update the mtime to be this value".
> 
> Since ext3 and ext4 do physical block journalling, we don't suffer
> from this problem since we always write the journal first, and the
> journal is flushed to disk before we update the metadata blocks.  On a
> power failure, when we replay the journal, if there is a corrupted fs
> metadata block due to the power failure causing garbage to be written
> to the disk, the damage can get repaired automatically when the
> journal is replayed.  And this is a good thing, since PC class
> hardware generally don't have power fail interrupts.  OTOH, ext3/4 is
> not as efficient as file systems such as XFS that do logical
> journalling.  Life is full of tradeoffs....
> 
> > EXT4 was always something like this:
> > 
> > # fsck.ext4 /dev/mmcblk1p2
> > e2fsck 1.42.5 (29-Jul-2012)
> > rootfs: recovering journal
> > Setting free inodes count to 37692 (was 37695)
> > Setting free blocks count to 136285 (was 136291)
> > rootfs: clean, 7140/44832 files, 42915/179200 blocks
> 
> This is normal, and not an indication of a problem.  In ext4, we
> longer update the free blocks and free inodes count in the superblock,
> except when we unmount the file system.  Otherwise on large CPU
> systems, needing to grab the superblock lock to update these values on
> each block/inode allocation and deallocation becomes a massive
> scalability bottleneck.
> 
> Instead, at mount time, since we need need to read in the block group
> descriptors *anyway*, we just sum the free blocks/inodes in each block
> group descriptor, and update the global counters then.  So the fact
> that the global free blocks and free inodes values in the superblock
> are not up to date after a crash is normal, and expected.  This is why
> e2fsck didn't ask for permission before fixing this issue, which it
> would have when you run e2fsck without either the -p or -y option.
> 
> > CASE 2: fsck on every other power-cycle:
> > 
> > Same as CASE 1 steps 1...5 and then:
> > 6.- Turn on power and boot again from dirty internal eMMC without running fsck.
> > 7.- Repeat steps 2...5 one more time
> > 8.- Perform steps 6...8 from CASE 1.
> > 
> > With this test, the following difference became apparent:
> > 
> > With EXT3: fsck.ext3 did the same as in CASE 1
> > 
> > With EXT4: I get a long list of errors that are being fixed.
> > It starts like this:
> > 
> > # fsck.ext4 /dev/mmcblk1p2
> > e2fsck 1.42.5 (29-Jul-2012)
> > rootfs: recovering journal
> > rootfs contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Inode 4591, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4594, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4595, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4596, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4597, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4598, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4599, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4600, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4601, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4602, i_blocks is 16, should be 8.  Fix<y>? yes
> > Inode 4603, i_blocks is 16, should be 8.  Fix<y>? yes
> 
> *This* is a problem.  And if you're seeing it for such a very large
> number of cases, it makes it unlikely to me that it is a ext4/jbd2
> bug.  For one thing, if it's so blatent, I should be able to replicate
> it very easily when using my "kill qemu" test.
> 
> What this makes me suspect is that you may be getting an older version
> of an inode table block.  That's why the above test procedure I
> outlined included the timestamp and the generation number in each
> block.  If after a power failure you start seeing an old generation
> number, then that's a hint that the FTL did somethign funky.
> 
> And the advantage of the using a test progam that writes to the raw
> eMMC flash is that it cuts the file system completely out of the
> picture....
> 
> Cheers,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html