lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140811205019.GB1695@birch.djwong.org>
Date:	Mon, 11 Aug 2014 13:50:19 -0700
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	"Theodore Ts'o" <tytso@....edu>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead
 metadata

On Mon, Aug 11, 2014 at 04:10:30PM -0400, Theodore Ts'o wrote:
> On Mon, Aug 11, 2014 at 11:55:32AM -0700, Darrick J. Wong wrote:
> > I was expecting 16 groups (32M readahead) to win, but as the observations in my
> > spreadsheet show, 2MB tends to win.  I _think_ the reason is that if we
> > encounter indirect map blocks or ETB blocks, they tend to be fairly close to
> > the file blocks in the block group, and if we're trying to do a large readahead
> > at the same time, we end up with a largeish seek penalty (half the flexbg on
> > average) for every ETB/map block.
> 
> Hmm, that might be an argument for not trying to increase the flex_bg
> size, since we want to keep seek distances within a flex_bg to be
> dominated by settling time, and not by the track-to-track
> accelleration/coasting/deaccelleration time.

It might not be too horrible of a regression, since the distance between tracks
has gotten shorter and cylinders themselves have gotten bigger.  I suppose
you'd have to test a variety of flexbg sizes against a disk from, say, 5 years
ago.  If you know the size of the files you'll be storing at mkfs time (such as
with the mk_hugefiles.c options) then increasing flexbg size is probably ok to
avoid fragmenting.

But yes, I was sort of enjoying how stuff within a flexbg gets (sort of) faster
as disks get bigger. :)

> > I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big
> > enough to store most of the metadata.  At that point, reads are essentially
> > free, but readahead costs us ~1ms per fadvise call. 
> 
> Do we understand why fadvise() takes 1ms?   Is that something we can fix?
> 
> And readahead(2) was even worse, right?

>From the readahead(2) manpage:

"readahead() blocks until the specified data has been read."

The fadvise time is pretty consistently 1ms, but with readahead you have to
wait for it to read everything off the disk.  That's fine for threaded
readahead, but for our single-thread readahead it's not much better than
regular blocking reads.  Letting the kernel do the readahead in the background
is way faster.

I don't know why fadvise takes so long.  I'll ftrace it to see where it goes.

--D
> 
> 							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ