[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140114012121.GF1214@kvack.org>
Date: Mon, 13 Jan 2014 20:21:21 -0500
From: Benjamin LaHaise <bcrl@...ck.org>
To: Theodore Ts'o <tytso@....edu>
Cc: Andreas Dilger <adilger@...ger.ca>,
Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: high write latency bug in ext3 / jbd in 3.4
On Mon, Jan 13, 2014 at 05:52:19PM -0500, Theodore Ts'o wrote:
> On Mon, Jan 13, 2014 at 04:16:10PM -0500, Benjamin LaHaise wrote:
> >
> > I'm leaning towards doing this. The main reason for not doing so was
> > primarily that a few of the tweaks that I had been made to ext3 would
> > have to be ported to ext4. Thankfully, I think we're still in an early
> > enough stage of release that I should be able to do so. The changes
> > are pretty specific, mostly allocator tweaks to improve the on-disk
> > layout for our specific use-case.
>
> We have been thinking about making some changes to the block
> allocator, so I'd be interested in hearing what tweaks you made and a
> bit more about your use case that drove the need for these allocator
> tweaks.
The main layout tweak is pretty specific to the ext2/3 style indirect /
double indirect block usage: instead of placing the ind/dind/tind blocks
throughtout the file, they are all placed immediately before the first
data block at fallocate() time. With that change in place, all of the
metadata blocks are then read at the same time the first page of the file is
read. The reason for doing this is that our spoolfiles have a header at
the beginning of the file that must always be read before we can find where
the data needed from the file is. By pulling in the metadata at the same
time as the first data block, the number of seeks to get data elsewhere in
the file is reduced (as some requests are essentially random). It also has
a nice side effect of speeding up unlink and fsck times.
The other allocator change which is more relevant to ext4 is to not use
orlov on subdirectories of the filesystem. There is a notable performance
difference when inodes are spread out across the filesystem. Our usage
pattern tends to be somewhat close to FIFO for the files written and later
read & deleted.
There are some other bits I plan to post shortly as well, including a fully
async implementation of readpage for use with ext2/3 style metadata. It was
necessary to make async reads fully non-blocking in order to hit the
performance targets, as switching to helper threads incurred a significant
amount of overhead compared to having aio completions from the interrupt
handler of the block device. I also did async read and readahead
implementations tied into aio. Development on the release I'm working on
is mostly done now, so I should have the time over the next few weeks to
clean up and merge these changes to 3.13.
> > I had hoped to use ext4, but the recommended fsck after changing the
> > various feature bits is a non-starter during our upgrade process (a 22
> > minute outage isn't acceptable).
>
> You can move to ext4 without necessarily using those features which
> require an fsck after the upgrade process. That's hwo we handled the
> upgrade to ext4 at Google. New disks were formatted using ext4, but
> for legacy file systems, we enabled extents feature (maybe one or two
> other ones, but that was the main one) and then remounted those file
> systems using ext4. We called file systems which were upgraded in
> this way "ext2-as-ext4", and our benchmarking indicated that for our
> workload, that "ext2-as-ext4" got roughly half the performance gained
> when comparing file systems still using ext2 with newly formated file
> systems using ext4.
Another reason for not being able to migrate to extents is that it breaks
the ability of our system to be downgraded smoothly. The previous kernel
being used was of 2.6.18 vintage, so this is the first version of our
product that supports using ext4. There were also concerns about testing
both the extent and non-extent code paths as well -- regression tests take
months to complete, so adding a times 2 multiplier to everything is a hard
sell.
> Given that file systems on a server got reformatted when it needs some
> kind of hardware repairs, betewen hardware refresh and disks getting
> reformatted as part of the refresh, the percentage of file systems
> running in "ext2-as-ext4" dropped fairly quickly.
Our filesystems are, unfortunately, rather long lived.
> Mike Rubin gave a presentation about this two years ago at the LF
> Collab Summit that went into a lot more detail about how ext4 was
> adopted by Google. That presentation is available here:
>
> http://www.youtube.com/watch?v=Wp5Ehw7ByuU
Thanks -- I'll pass that along to folks here at Solace.
-ben
> Cheers,
>
> - Ted
--
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists