[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1183149414.12702.10.camel@kleikamp.austin.ibm.com>
Date: Fri, 29 Jun 2007 15:36:53 -0500
From: Dave Kleikamp <shaggy@...ux.vnet.ibm.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: "Theodore Ts'o" <tytso@....edu>,
Andreas Dilger <adilger@...sterfs.com>,
Mike Waychison <mikew@...gle.com>,
Sreenivasa Busam <sreenivasac@...gle.com>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: fallocate support for bitmap-based files
On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
> Guys, Mike and Sreenivasa at google are looking into implementing
> fallocate() on ext2. Of course, any such implementation could and should
> also be portable to ext3 and ext4 bitmapped files.
>
> I believe that Sreenivasa will mainly be doing the implementation work.
>
>
> The basic plan is as follows:
>
> - Create (with tune2fs and mke2fs) a hidden file using one of the
> reserved inode numbers. That file will be sized to have one bit for each
> block in the partition. Let's call this the "unwritten block file".
>
> The unwritten block file will be initialised with all-zeroes
>
> - at fallocate()-time, allocate the blocks to the user's file (in some
> yet-to-be-determined fashion) and, for each one which is uninitialised,
> set its bit in the unwritten block file. The set bit means "this block
> is uninitialised and needs to be zeroed out on read".
>
> - truncate() would need to clear out set-bits in the unwritten blocks file.
By truncating the blocks file at the correct byte offset, only needing
to zero some bits of the last byte of the file.
> - When the fs comes to read a block from disk, it will need to consult
> the unwritten blocks file to see if that block should be zeroed by the
> CPU.
>
> - When the unwritten-block is written to, its bit in the unwritten blocks
> file gets zeroed.
>
> - An obvious efficiency concern: if a user file has no unwritten blocks
> in it, we don't need to consult the unwritten blocks file.
>
> Need to work out how to do this. An obvious solution would be to have
> a number-of-unwritten-blocks counter in the inode. But do we have space
> for that?
Would it be too expensive to test the blocks-file page each time a bit
is cleared to see if it is all-zero, and then free the page, making it a
hole? This test would stop if if finds any non-zero word, so it may not
be too bad. (This could further be done on a block basis if the block
size is less than a page.)
> (I expect google and others would prefer that the on-disk format be
> compatible with legacy ext2!)
>
> - One concern is the following scenario:
>
> - Mount fs with "new" kernel, fallocate() some blocks to a file.
>
> - Now, mount the fs under "old" kernel (which doesn't understand the
> unwritten blocks file).
>
> - This kernel will be able to read uninitialised data from that
> fallocated-to file, which is a security concern.
>
> - Now, the "old" kernel writes some data to a fallocated block. But
> this kernel doesn't know that it needs to clear that block's flag in
> the unwritten blocks file!
>
> - Now mount that fs under the "new" kernel and try to read that file.
> The flag for the block is set, so this kernel will still zero out the
> data on a read, thus corrupting the user's data
>
> So how to fix this? Perhaps with a per-inode flag indicating "this
> inode has unwritten blocks". But to fix this problem, we'd require that
> the "old" kernel clear out that flag.
>
> Can anyone propose a solution to this?
>
> Ah, I can! Use the compatibility flags in such a way as to prevent the
> "old" kernel from mounting this filesystem at all. To mount this fs
> under an "old" kernel the user will need to run some tool which will
>
> - read the unwritten blocks file
>
> - for each set-bit in the unwritten blocks file, zero out the
> corresponding block
>
> - zero out the unwritten blocks file
>
> - rewrite the superblock to indicate that this fs may now be mounted
> by an "old" kernel.
>
> Sound sane?
Yeah. I think it would have to be done under a compatibility flag. Is
going back to an older kernel really that important? I think it's more
important to make sure it can't be mounted by an older kernel if bad
things can happen, and they can.
Shaggy
--
David Kleikamp
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists