[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140730144928.GA10295@kvack.org>
Date: Wed, 30 Jul 2014 10:49:28 -0400
From: Benjamin LaHaise <bcrl@...ck.org>
To: Andreas Dilger <adilger@...ger.ca>
Cc: Theodore Ts'o <tytso@....edu>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds
Hi Andreas, Ted,
I've finally had some more time to dig into this problem, and it's worse
than I initially thought in that it occurs on normal ext4 filesystems.
On Mon, Jul 07, 2014 at 11:11:58PM -0600, Andreas Dilger wrote:
...
> The main problem here is that reading all of the block bitmaps takes
> a huge amount of time for a large filesystem.
Very true.
...
>
> 7.8TB / 128MB/group ~= 8000 groups
> 8000 bitmaps / 100 seeks/sec = 80s
>
> So that is what is making things slow. Once the allocator has all the
> blocks in memory there are no problems. There are some heuristics
> to skip bitmaps that are totally full, but they don't work in your case.
>
> This is why the flex_bg feature was created - to allow the bitmaps
> to be read from disk without seeks. This also speeds up e2fsck by
> the same 96s that would otherwise be wasted waiting for the disk.
Unfortunately, that isn't the case.
> Backporting flex_bg to ext3 would be fairly trivial - just disable the checks
> for the location of the bitmaps at mount time. However, using it
> requires that you reformat your filesystem with "-O flex_bg" to
> get the improved layout.
flex_bg is not sufficient to resolve this issue. Using a native ext4
formatted filesystem initialized with mke4fs 1.41.12, this problem still
occurs. I created a 7.1TB filesystem, filled it to about 92% full with
8MB files. The time to create a new 8MB file after a fresh mount ranges
from 0.017 seconds 13.2 seconds. The outlier correlates with bitmaps
being read from disk. A copy of /proc/fs/ext4/dm-2/mb_groups from this
92% full fs is available at http://www.kvack.org/~bcrl/mb_groups.ext4-92
Note that is isn't the first allocating write to the filesystem that is
the worst in terms of timing, it can end up being the 10th or even the
100th attempt.
> The other option (if your runtime environment allows it) is to prefetch
> the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the
> filesystem is in use. This still takes 90s, but can be started early in
> the boot process on each disk in parallel.
That isn't a solution. Prefetching is impossible in my particular use-case,
as the filesystem is being mounted after a failover from another node --
any data prefetched prior to switching active nodes is not guaranteed to be
valid.
This seems like a pretty serious regression relative to ext3. Why can't
ext4's mballoc pick better block groups to attempt allocating from based
on the free block counts in the block group summaries?
-ben
--
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists