linux-ext4 - AW: ext4 filesystem bad extent error review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B8A948099C53E0408BDBCE749AAECA9A2A80C78543@SI-MBX10.de.bosch.com>
Date:	Fri, 3 Jan 2014 17:29:32 +0100
From:	"Juergens Dirk (CM-AI/ECO2)" <Dirk.Juergens@...bosch.com>
To:	Theodore Ts'o <tytso@....edu>,
	"Huang Weller (CM/ESW12-CN)" <Weller.Huang@...bosch.com>
CC:	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: AW: ext4 filesystem bad extent error review

On Thu, Jan 02, 2014 at 19:42, Theodore Ts'o [mailto:tytso@....edu]
wrote:
> On Thu, Jan 02, 2014 at 12:59:52PM +0800, Huang Weller (CM/ESW12-CN)
> wrote:
> >
> > We did more test which we backup the journal blocks  before we mount
> the test partition.
> > Actually, before we mount the test partition, we use fsck.ext4 with -
> n option to verify whether there is any  bad extents issues available.
> The fsck.ext4 never found any such kind issue. And we can prove that
> the bad extents issue is happened after journaling replay.
> 
> Ok, so that implies that the failure is almost certainly due to
> corrupted blocks in the journal.  Hence, when we replay the journal, it
> causes the the file system to become corrupted, because the "newer"
> (and presumably, "more correct") metadata blocks found in the blocks
> recorded in the journal are in fact corrupted.
> 
.....
> >
> > We  searched such error on internet, there are some one also has such
> issue. But there is no solution.
> > This issue maybe not a big issue which it can be repaired by
> fsck.ext4 easily. But we have below questions:
> > 1. whether this issue already been fixed in the latest kernel version?
> > 2. based on the information I provided in this mail, can you help to
> solve this issue ?
> 
> Well, the question is how did the journal get corrupted?  It's possible
> that it's caused by a kernel bug, although I'm not aware of any such
> bugs being reported.
> 
> In my mind, the most likely cause is that the SD card is ignoring the
> CACHE FLUSH command, or is not properly saving the SD card's Flash
> Translation Layer (FTL) metadata on a power drop.  

Yes, this could be a possible reason, but we did exactly the same test
not only with power drops but also with doing only iMX watchdog resets. 
In the latter case there was no power drop for the eMMC, but we 
observed exactly the same kind of inode corruption.

During thousands of test loops with power drops or watchdog resets, while 
creating thousands of files with multiple threads, we did not observe any 
other kind of ext4 metadata damage or file content damage. 

And in the error case so far we always found only a single damaged inode.
The other inodes before and after the damaged inode in the journal, in the
same logical 4096 bytes block, seem to be intact and valid (examined with 
a hex editor). And in all the failure cases - as far as we can say based 
on the ext4 disk layout documentation - only the ee_len or the ee_start_hi 
and ee_start_lo entries are wrong (i.e. zeroed).
    
The eMMC has no "knowledge" about the logical meaning or the offset of 
ee_len or ee_start. Thus, it does not seem very likely that whatever kind of
internal failure or bug in the eMMC controller/firmware always and only
damages these few bytes.

> What I tell people who are using flash devices is before they start
> using any flash device, to do power drop testing on a raw device,
> without any file system present.  The simplest way to do this is to
> write a program that writes consecutive 4k blocks that contain a
> timestamp, a sequence number, some random data, and a CRC-32 checksum
> over the contents of the timestamp, sequence number, a flags word, and
> random data.  As the program writes such 4k block, it rolls the dice
> and once every 64 blocks or so (i.e., pick a random number, and see if
> it is divisible by 64), then set a bit in the flags word indicating
> that this block was forced out using a cache flush, and then when
> writing this block, follow up the write with a CACHE FLUSH command.
> It's also best if the test program prints the blocks which have been
> written with CACHE FLUSH to the serial console, and that this is saved
> by your test rig.

We did similar tests in the past, but not yet with this particular type
of eMMC. I think we should repeat with this particular type.

> 
> (This is what ext4's journal does before and after writing the commit
> block in the journal, and it guarantees that (a) all of the data in the
> journal written up to the commit block will be available after a power
> drop, and (b) that the commit block has been written to the storage
> device and again, will be available after a power drop.)
>

Well, we also did the same tests with journal_checksum enabled. We were 
still able to reproduce the failure w/o any checksumming error. So we
believe that the respective transaction (as well as all others) was 
complete and not corrupted by the eMMC. 
Is this a valid assumption ? If so, I would assume that the corrupted
Inode was really written to the eMMC and not corrupted by the eMMC. 

(BTW, we do know that journal_checksum is somehow critical and might make 
things worse, but for test purpose and to exclude that the eMMC delivers
corrupted transactions when reading the data, it seemed to be a meaningful
approach)
 
So, I think there _might_ be a kernel bug, but it could be also a problem 
related to the particular type of eMMC. We did not observe the same issue
in previous tests with another type of eMMC from another supplier, but this
was with an older kernel patch level and with another HW design.

Regarding a possible kernel bug: Is there any chance that the invalid 
ee_len or ee_start are returned by, e.g., the block allocator ?
If so, can we try to instrument the code to get suitable traces ?
Just to see or to exclude that the corrupted inode is really written
to the eMMC ?


Mit freundlichen Grüßen / Best regards

Dirk Juergens

Robert Bosch Car Multimedia GmbH

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html