linux-ext4 - Re: [RFC] Metadata Replication for Ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111021000959.GC14993@dastard>
Date:	Fri, 21 Oct 2011 11:09:59 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Andreas Dilger <adilger@...ger.ca>
Cc:	Lukas Czerner <lczerner@...hat.com>,
	Aditya Kali <adityakali@...gle.com>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Nauman Rafique <nauman@...gle.com>,
	TheodoreTso <tytso@...gle.com>,
	Ric Wheeler <rwheeler@...hat.com>,
	"Alasdair G.Kergon" <agk@...hat.com>,
	Christoph Hellwig <hch@...radead.org>
Subject: Re: [RFC] Metadata Replication for Ext4

On Wed, Oct 19, 2011 at 10:19:15AM -0600, Andreas Dilger wrote:
> On 2011-10-19, at 8:10 AM, Lukas Czerner <lczerner@...hat.com>
> wrote:
> > On Tue, 18 Oct 2011, Aditya Kali wrote:
> > 
> >> This is a proposal for new ext4 feature that replicates ext4
> >> metadata and provides recovery in case where device blocks
> >> storing filesystem metadata goes bad. When the filesystem
> >> encounters a bad block during read, it returns EIO to the user.
> >> If this is a data block for some inode then the user
> >> application can handle this error in many different ways. But
> >> if we fail reading a filesystem metadata block (bitmap block,
> >> inode table block, directory block, etc.), we could potentially
> >> lose access to much larger amount of data and render the
> >> filesystem unusable. It is difficult (and not expected) for the
> >> user application to recover from such filesystem metadata loss.
> >> This problem is observed to be much more severe on SSDs which
> >> tend to show more frequent read errors when compared to disks
> >> over the same duration.
> >> 
> >> There are different ways in which block read errors in
> >> different metadata could be handled. For example, if the
> >> filesystem is unable to read a block/inode allocation bitmap
> >> then we could just assume that all the blocks/inodes in that
> >> block group are allocated and let fsck fix this later. For
> >> inode table and directory blocks, we could play some (possibly
> >> unreliable) tricks with fsck. In either case, the filesystem
> >> will be fully usable only after it’s fsck’d (which is a
> >> disruptive process on production systems). Darrick Wong’s
> >> recent patches for metadata checksumming will detect even more
> >> non-hardware failure related problems, but they don’t offer
> >> any recovery mechanism from the checksum failures.
> >> 
> >> Metadata replication is another approach that can allow the
> >> filesystem to recover from the device read errors or
> >> checksumming errors at runtime and allow continued usage of the
> >> filesystem. In case of read failures or checksum failures,
> >> reading from the replica can allow live recovery of the lost
> >> metadata. This document gives some details about how the Ext4
> >> metadata could be replicated and used by the filesystem.
> > 
> > Hi Aditya,
> > 
> > While reading those three paragraphs I found the idea
> > interesting, however it would be just great to have more generic
> > solution for this problem. One, which comes immediately to mind,
> > is mirroring, however this will, of course, mirror all data, not
> > just metadata.
> > 
> > But, we are already marking metadata-read bios (REQ_META), so it
> > has better priority. What about doing the same thing on write
> > side and having metadata-mirroring dm target which will mirror
> > only those ?
> 
> While I like the idea of metadata mirroring, I share Lukaz's
> concern about the added complexity to the code. 
> 
> I've already done a bunch of experimentation with formatting the
> filesystem with flex_bg and storing the metadata in the first
> block group of every 256 block groups, and then allocating these
> block groups on flash via DM, and the other 255 block groups in
> the flex_bg are on RAID-6.  

I've tried playing these sorts of metadata location constraining and
DM mapping games with XFS, too, and came to the conclusion it was
just too fragile to be considered for production use.  Not to
mention it's difficult to configure and maintain from an admin point
of view as well.

> This locates all of the static metadata in 1 of 256 block groups,
> and ext4 will already prefer to allocate the dynamic metadata in
> the first group of a flex_bg.
> 
> It wouldn't be very difficult to set up the metadata groups as
> RAID-1 mirrors. 
>
> > This way we'll get generic solution for all file systems, the
> > only thing that file system should do in order to take an
> > advantage of this is to mark its metadata writes accordingly.

This is similar to the conclusion I've come to for XFS - replicating
metadata inside the filesystem is simply too invasive and difficult
to implement sanely. To do correctly, the filesystem has to know
exactly what areas of the filesystem address space are independent
failure domains to make the correct decision as to where to place
replicated metadata. That's not simple to communicate to the
filesystem from the lower layers of the storage stack.

The solution I'm looking at is to give XFS a separate "metadata
device" (like we have support for an external log device) and
allocating all metadata on that device. It is a much simpler
solution from a code, maintenance and administration point of view,
and doesn't require a special new device mapper target that
replicates only metadata writes.  All it requires is:

> Right, there needs to be some way for the upper layer to know
> which copy was read, so that in case of a checksum failure it can
> request the other copy.  For RAID-5/6 it would need to know which
> disks were used for parity (if any) and then request parity
> reconstruction with a different disk until it matches the
> checksum. 

...this. And if it can't be done, we get a hard failure. I even
suspect the upper layer doesn't even need to care what copy it got
that was bad - if the underlying device has a concept of "primary
copy" for replication/recovery purposes, then all we need is a "read
alternate/secondary version" request, which could simply be a new
REQ_META_ALT request tag....

The use of an external device for metadata also allows admins to
easily separate data and metadata, grow the metadata space
separately to data space, put metadata on SSDs instead of spinning
disks, etc. IOWs, this approach kills about 5 XFS feature request
birds with the one stone. :)

There are many, many benefits to this style metadata replication and
error recovery, the least being that it is filesystem
independent....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html