linux-ext4 - Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140811205019.GB1695@birch.djwong.org>
Date:	Mon, 11 Aug 2014 13:50:19 -0700
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	"Theodore Ts'o" <tytso@....edu>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead
 metadata

On Mon, Aug 11, 2014 at 04:10:30PM -0400, Theodore Ts'o wrote:
> On Mon, Aug 11, 2014 at 11:55:32AM -0700, Darrick J. Wong wrote:
> > I was expecting 16 groups (32M readahead) to win, but as the observations in my
> > spreadsheet show, 2MB tends to win.  I _think_ the reason is that if we
> > encounter indirect map blocks or ETB blocks, they tend to be fairly close to
> > the file blocks in the block group, and if we're trying to do a large readahead
> > at the same time, we end up with a largeish seek penalty (half the flexbg on
> > average) for every ETB/map block.
> 
> Hmm, that might be an argument for not trying to increase the flex_bg
> size, since we want to keep seek distances within a flex_bg to be
> dominated by settling time, and not by the track-to-track
> accelleration/coasting/deaccelleration time.

It might not be too horrible of a regression, since the distance between tracks
has gotten shorter and cylinders themselves have gotten bigger.  I suppose
you'd have to test a variety of flexbg sizes against a disk from, say, 5 years
ago.  If you know the size of the files you'll be storing at mkfs time (such as
with the mk_hugefiles.c options) then increasing flexbg size is probably ok to
avoid fragmenting.

But yes, I was sort of enjoying how stuff within a flexbg gets (sort of) faster
as disks get bigger. :)

> > I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big
> > enough to store most of the metadata.  At that point, reads are essentially
> > free, but readahead costs us ~1ms per fadvise call. 
> 
> Do we understand why fadvise() takes 1ms?   Is that something we can fix?
> 
> And readahead(2) was even worse, right?

>From the readahead(2) manpage:

"readahead() blocks until the specified data has been read."

The fadvise time is pretty consistently 1ms, but with readahead you have to
wait for it to read everything off the disk.  That's fine for threaded
readahead, but for our single-thread readahead it's not much better than
regular blocking reads.  Letting the kernel do the readahead in the background
is way faster.

I don't know why fadvise takes so long.  I'll ftrace it to see where it goes.

--D
> 
> 							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html