[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4685829D.2020401@google.com>
Date: Fri, 29 Jun 2007 18:07:25 -0400
From: Mike Waychison <mikew@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
CC: Theodore Tso <tytso@....edu>,
Andreas Dilger <adilger@...sterfs.com>,
Sreenivasa Busam <sreenivasac@...gle.com>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: fallocate support for bitmap-based files
Andrew Morton wrote:
> On Fri, 29 Jun 2007 16:55:25 -0400
> Theodore Tso <tytso@....edu> wrote:
>
>
>>On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:
>>
>>>Guys, Mike and Sreenivasa at google are looking into implementing
>>>fallocate() on ext2. Of course, any such implementation could and should
>>>also be portable to ext3 and ext4 bitmapped files.
>>
>>What's the eventual goal of this work? Would it be for mainline use,
>>or just something that would be used internally at Google?
>
>
> Mainline, preferably.
>
>
>> I'm not
>>particularly ennthused about supporting two ways of doing fallocate();
>>one for ext4 and one for bitmap-based files in ext2/3/4. Is the
>>benefit reallyworth it?
>
>
> umm, it's worth it if you don't want to wear the overhead of journalling,
> and/or if you don't want to wait on the, err, rather slow progress of ext4.
>
>
>>What I would suggest, which would make much easier, is to make this be
>>an incompatible extensions (which you as you point out is needed for
>>security reasons anyway) and then steal the high bit from the block
>>number field to indicate whether or not the block has been initialized
>>or not. That way you don't end up having to seek to a potentially
>>distant part of the disk to check out the bitmap. Also, you don't
>>have to worry about how to recover if the "block initialized bitmap"
>>inode gets smashed.
>>
>>The downside is that it reduces the maximum size of the filesystem
>>supported by ext2 by a factor of two. But, there are at least two
>>patch series floating about that promise to allow filesystem block
>>sizes > than PAGE_SIZE which would allow you to recover the maximum
>>size supported by the filesytem.
>>
>>Furthermore, I suspect (especially after listening to a very fasting
>>Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
>>ago) that for many of Google's workloads, using a filesystem blocksize
>>of 16K or 32K might not be a bad thing in any case.
>>
>>It would be a lot simpler....
>>
>
>
> Hadn't thought of that.
>
> Also, it's unclear to me why google is going this way rather than using
> (perhaps suitably-tweaked) ext2 reservations code.
>
> Because the stock ext2 block allcoator sucks big-time.
The primary reason this is a problem is that our writers into these
files aren't neccesarily coming from the same hosts in the cluster, so
their arrival times aren't sequential. It ends up looking to the kernel
like a random write workload, which in turn ends up causing odd
fragmentation patterns that aren't very deterministic. That data is
often eventually streamed off the disk though, which is when the
fragmentation hurts.
Currently, our clustered filesystem supports pre-allocation of the
target chunks of files, but this is implemented by writting effectively
zeroes to files, which in turn causes pagecache churn and a double
write-out of the blocks. Recently, we've changed the code to minimize
this pagecache churn and double write out by performing an ftruncate to
extend files, but then we'll be back to square-one in terms of
fragmentation for the random writes.
Relying on (a tweaked) reservations code is also somewhat limitting at
this stage given that reservations are lost on close(fd). Unless we
change the lifetime of the reservations (maybe for the lifetime of the
in-core inode?), crank up the reservation sizes and deal with the
overcommit issues, I can't think of any better way at this time to deal
with the problem.
Mike Waychison
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists