linux-ext4 - Re: Threaded readahead strawman

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <470E603A.2080203@clusterfs.com>
Date:	Thu, 11 Oct 2007 20:41:14 +0300
From:	"Vladimir V. Saveliev" <vs@...sterfs.com>
To:	Andreas Dilger <adilger@...sterfs.com>
CC:	Valerie Henson <val.henson@...il.com>,
	Theodore Ts'o <tytso@....edu>, Ric Wheeler <ric@....com>,
	linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: Threaded readahead strawman

Hello

Andreas Dilger wrote:
> On Oct 10, 2007  20:09 -0700, Valerie Henson wrote:
>> I need to get started on a mergeable version of the threaded readahead
>> patch for e2fsck.  I intend for it to be compatible with Andreas'
>> sys_readahead() for block devices that support it.  Here's a first
>> draft proposal - your thoughts?  Note that it's not really that
>> anything is being read *ahead* per se, but that it's being read
>> simultaneously.  Single-threaded readahead doesn't go any faster.
> 
> We've been fiddling with this as well.  I'd attach some patches but
> bugzilla is down as I write this :(.  I also asked Vladimir (working on
> these patches) to forward them to you and the linux-ext4 mailing list.
> 

The patch is attached.

If an application can foresee what it is going to read in future - it
can call io_channel_readahead for those data forehand. Even if
io_channel_readahead is called right before the data are actually needed
- it may make positive effect for multi disk devices because of parallel
reading.

For example, using io_channel_readahead to readahead coming inode tables
in done_group callback of ext2_inode_scan changes inode table scan in my
local quick test from 34 seconds to 26 (on 2 two ide disk raid0)

> We added a "readahead" method to the io_manager interface (no-op for
> Win/DOS) that can be used generically.  This is currently done via
> posix_fadvise(POSIX_FADV_WILLNEED).  We haven't done any multi-threading
> yet, but there is some hope that the block layer could sort it out?
> It would still be beneficial to have multiple user-space threads do
> the reading of the data, to get parallel memcpy() into userspace.
> 
>> The major global parameters to the system are:
>>
>> 1. Optimal number of concurrent requests - number of underlying read
>> heads times some N of best number of outstanding requests.  Default to
>> one.
>>
>> 2. Stripe size, or more generally which areas can be read concurrently
>> and which cannot.
> 
> There are new parameters in the superblock (s_raid_stride and
> s_raid_stripe_width) but as yet only s_raid_stride is initialized by
> mke2fs.  There is a library in xfstools (libdisk or somesuch) that
> can get a lot more disk geometry info and it would be good to leverage
> that for mke2fs also.
> 
>> 3. Maximum memory to use.  We have to keep the readahead from
>> outrunning the actual processing (though so far, that hasn't been a
>> problem) and having bits of our buffer cache kicked out before they
>> are used.  This can be set to some percentage of available memory by
>> default.
> 
> Agreed.  I'd proposed in the past that fsck could call fsck.{fstype}
> with a parameter like --expected-memory to determine the expected memory
> usage of fsck.{fstype} based on the filesystem geometry, and it could
> also supply --max-memory so we don't have parallel fscks stomping on
> each other.
> 
>> I see two main ways to do this: One is a straightforward offset plus
>> size, telling it what to read.  The other is to make libext2 do all
>> the interpretation of ondisk format, and design the interface in terms
>> of kinds of metadata to read.  Given that libext2 functions like
>> ext2fs_get_next_inode_full() should be aware of what's going on in
>> readahead.  This argues for a metadata aware, in-library
>> implementation.  Something like:
>>
>> /* Creates the threads, sets some variables.  Returns a handle. */
>> handle = ext2fs_readahead_init(concurrent_requests, stripe_size, max_memory);
>>
>> /* Readahead inode tables and inode indirect blocks - can't really be
>> separated */
>> ext2fs_readahead_inodes(handle, fs);
> 
> Well, there's something to be said for allowing the inode tables and
> corresponding bitmaps to be read in a single shot.  Also, not all users
> require the indirect blocks, so I would make that an option.
> 
>> /* Read the directory block list (pass 2) */
>> ext2fs_readahead_dblist(handle, fs);
> 
> We're working on this as part of e2scan (in bug 13108 above), not sure if
> there is a patch available or not.
> 
>> /* Read bitmaps (pass 5) */
>> ext2fs_readahead_bitmaps(handle, fs);
> 
> This is a big one, because of the many seeks for small data read.  Using
> the FLEX_BG feature (which is really a tiny kernel patch) could improve
> this many times.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> 


View attachment "e2fsprogs-add-io_channel_readahead.patch" of type "text/x-patch" (5137 bytes)