lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140730144928.GA10295@kvack.org>
Date:	Wed, 30 Jul 2014 10:49:28 -0400
From:	Benjamin LaHaise <bcrl@...ck.org>
To:	Andreas Dilger <adilger@...ger.ca>
Cc:	Theodore Ts'o <tytso@....edu>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds

Hi Andreas, Ted,

I've finally had some more time to dig into this problem, and it's worse 
than I initially thought in that it occurs on normal ext4 filesystems.

On Mon, Jul 07, 2014 at 11:11:58PM -0600, Andreas Dilger wrote:
...
> The main problem here is that reading all of the block bitmaps takes
> a huge amount of time for a large filesystem.

Very true.

...
> 
> 7.8TB / 128MB/group ~= 8000 groups
> 8000 bitmaps / 100 seeks/sec = 80s
> 
> So that is what is making things slow. Once the allocator has all the
> blocks in memory there are no problems. There are some heuristics
> to skip bitmaps that are totally full, but they don't work in your case. 
> 
> This is why the flex_bg feature was created - to allow the bitmaps
> to be read from disk without seeks.  This also speeds up e2fsck by
> the same 96s that would otherwise be wasted waiting for the disk.

Unfortunately, that isn't the case.

> Backporting flex_bg to ext3 would be fairly trivial - just disable the checks
> for the location of the bitmaps at mount time. However, using it
> requires that you reformat your filesystem with "-O flex_bg" to
> get the improved layout. 

flex_bg is not sufficient to resolve this issue.  Using a native ext4 
formatted filesystem initialized with mke4fs 1.41.12, this problem still 
occurs.  I created a 7.1TB filesystem, filled it to about 92% full with 
8MB files.  The time to create a new 8MB file after a fresh mount ranges 
from 0.017 seconds 13.2 seconds.  The outlier correlates with bitmaps 
being read from disk.  A copy of /proc/fs/ext4/dm-2/mb_groups from this 
92% full fs is available at http://www.kvack.org/~bcrl/mb_groups.ext4-92 

Note that is isn't the first allocating write to the filesystem that is 
the worst in terms of timing, it can end up being the 10th or even the 
100th attempt.

> The other option (if your runtime environment allows it) is to prefetch
> the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the
> filesystem is in use. This still takes 90s, but can be started early in
> the boot process on each disk in parallel.

That isn't a solution.  Prefetching is impossible in my particular use-case, 
as the filesystem is being mounted after a failover from another node -- 
any data prefetched prior to switching active nodes is not guaranteed to be 
valid.

This seems like a pretty serious regression relative to ext3.  Why can't 
ext4's mballoc pick better block groups to attempt allocating from based 
on the free block counts in the block group summaries?

		-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ