[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111214011028.GA8233@tux1.beaverton.ibm.com>
Date: Tue, 13 Dec 2011 17:10:28 -0800
From: "Darrick J. Wong" <djwong@...ibm.com>
To: Andreas Dilger <adilger.kernel@...ger.ca>,
Theodore Tso <tytso@....edu>
Cc: Sunil Mushran <sunil.mushran@...cle.com>,
Martin K Petersen <martin.petersen@...cle.com>,
Greg Freemyer <greg.freemyer@...il.com>,
Amir Goldstein <amir73il@...il.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Andi Kleen <andi@...stfloor.org>,
Mingming Cao <cmm@...ibm.com>,
Joel Becker <jlbec@...lplan.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
linux-ext4@...r.kernel.org, Coly Li <colyli@...il.com>
Subject: Re: [PATCH v2.2 00/23] ext4: Add metadata checksumming
On Tue, Dec 13, 2011 at 04:46:14PM -0800, Darrick J. Wong wrote:
> Hi all,
>
> This patchset adds crc32c checksums to most of the ext4 metadata objects. A
> full design document is on the ext4 wiki[1] but I will summarize that document here.
>
> As much as we wish our storage hardware was totally reliable, it is still
> quite possible for data to be corrupted on disk, corrupted during transfer over
> a wire, or written to the wrong places. To protect against this sort of
> non-hostile corruption, it is desirable to store checksums of metadata objects
> on the filesystem to prevent broken metadata from shredding the filesystem.
>
> The crc32c polynomial was chosen for its improved error detection capabilities
> over crc32 and crc16, and because of its hardware acceleration on current and
> upcoming Intel and Sparc chips.
>
> Each type of metadata object has been retrofitted to store a checksum as follows:
>
> - The superblock stores a crc32c of itself.
> - Each inode stores crc32c(fs_uuid + inode_num + inode_gen + inode +
> slack_space_after_inode)
> - Block and inode bitmaps each get their own crc32c(fs_uuid + group_num +
> bitmap), stored in the block group descriptor.
> - Each extent tree block stores a crc32c(fs_uuid + inode_num + inode_gen +
> extent_entries) in unused space at the end of the block.
> - Each directory leaf block has an unused-looking directory entry big enough to
> store a crc32c(fs_uuid + inode_num + inode_gen + block) at the end of the
> block.
> - Each directory htree block is shortened to contain a crc32c(fs_uuid +
> inode_num + inode_gen + block) at the end of the block.
> - Extended attribute blocks store crc32c(fs_uuid + id + ea_block) in the
> header, where id is, depending on the refcount, either the inode_num and
> inode_gen; or the block number.
> - MMP blocks store crc32c(fs_uuid + mmpblock) at the end of the MMP block.
> - Block groups can now use crc32c instead of crc16.
> - The journal now has a v2 checksum feature flag.
> - crc32c(j_uuid + block) checksums have been inserted into descriptor blocks,
> commit blocks, revoke blocks, and the journal superblock.
> - Each block tag in a descriptor block has a checksum of the related data block.
>
> The patchset for e2fsprogs will be sent under separate cover only to linux-ext4
> as it is quite lengthy (~48 patches).
>
> As far as performance impact goes, I see nearly no change with a standard mail
> server ffsb simulation. On a test that involves only file creation and
> deletion and extent tree modifications, I see a drop of about 50 percent with
> the current kernel crc32c implementation; this improves to a drop of about 20
> percent with the enclosed crc32c implementation. However, given that metadata
> is usually a small fraction of total IO, it doesn't seem like the cost of
> enabling this feature is unreasonable.
>
> There are of course unresolved issues:
>
> - I haven't fixed it up to checksum the exclude bitmap yet. I'll probably
> submit that as an add-on to the snapshot patchset.
>
> - Using the journal commit hooks to delay crc32c calculation until dirty
> buffers are actually being written to disk.
>
> - Interaction with online resize code. Yongqiang seems to be in the process of
> rewriting this not to use custom metadata block write functions, but I haven't
> looked at it very closely yet.
>
> Please have a look at the design document and patches, and please feel free to
> suggest any changes.
>
> v2: Checksum the MMP block, store the checksum type in the superblock, include
> the inode generation in file checksums, and finally solve the problem of limited
> space in block groups by splitting the checksum into two halves.
>
> v2.1: Checksum the reserved parts of the htree tail structure. Fix some flag
> handling bugs with the mb cache init routine wherein bitmaps could fail to be
> checksummed at read time.
>
> v2.2: Reincorporate the FS UUID in the bitmap checksum calcuations. Move all
> disk layout changes to the front and the feature flag enablement to the end of
> the patch set. Fail journal recovery if revoke block fails checksum.
>
> This patchset has been tested on 3.2.0-rc5 on x64, i386, ppc64, and ppc32. The
> patches seems to work fine on all four platforms.
OH COME ON!!!!
stgit helpfully changed the From lines, screwing everything up. Awesome. I
apologize, I wasn't expecting stgit to change the From: lines when I migrated
the disk format changes to the front of the set.
I don't really want to respam the list just to fix this one little thing. If
people want more code changes I'll gladly make them and re-send. If not, then
Ted, I can just send you the whole mess as a big mbox file, or post them on a
webserver somewhere, with correct attribution.
--D
>
> --D
>
> [1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists