lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141210202756.GA21938@birch.djwong.org>
Date:	Wed, 10 Dec 2014 12:27:56 -0800
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	tytso@....edu
Cc:	linux-ext4@...r.kernel.org
Subject: Re: [PATCH 32/47] e2fsck: read-ahead metadata during passes 1, 2,
 and 4

On Fri, Nov 07, 2014 at 01:54:14PM -0800, Darrick J. Wong wrote:
> e2fsck pass1 is modified to use the block group data prefetch function
> to try to fetch the inode tables into the pagecache before it is
> needed.  We iterate through the blockgroups until we have enough inode
> tables that need reading such that we can issue readahead; then we sit
> and wait until the last inode table block read of the last group to
> start fetching the next bunch.
> 
> pass2 is modified to use the dirblock prefetching function to prefetch
> the list of directory blocks that are assembled in pass1.  We use the
> "iterate a subset of a dblist" and avoid copying the dblist.  Directory
> blocks are fetched incrementally as we walk through the directory
> block list.  In previous iterations of this patch we would free the
> directory blocks after processing, but the performance hit to e2fsck
> itself wasn't worth it.  Furthermore, it is anticipated that most
> users will then mount the FS and start using the directories, so they
> may as well remain in the page cache.
> 
> pass4 is modified to prefetch the block and inode bitmaps in
> anticipation of pass 5, because pass4 is entirely CPU bound.
> 
> In general, these mechanisms can decrease fsck time by 10-40%, if the
> host system has sufficient memory and the storage system can provide a
> lot of IOPs.  Pretty much any storage system capable of handling
> multiple IOs in-flight at any time will see a fairly large performance
> boost.  (Single-issue USB mass storage disks seem to suffer badly.)

FWIW I finally got UAS working in Linux; with that hardware and the default
readahead size, I see about a 5% reduction in fsck runtime.  The same dock
plugged into a USB2 port (i.e. BOT) shows about a 2% increase in runtime.

--D

> 
> By default, the readahead buffer size will be set to the size of a block
> group's inode table (which is 2MiB for a regular ext4 FS).  The -E
> readahead_kb= option can be given to specify the amount of memory to
> use for readahead or zero to disable it entirely; or an option can be
> given in e2fsck.conf.
> 
> v2: Fix an off-by-one error in the pass1 readahead which made the
> readahead trigger one inode too late if the block groups are full.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@...cle.com>
> ---
>  e2fsck/e2fsck.8.in      |    7 +++++
>  e2fsck/e2fsck.conf.5.in |   15 +++++++++++
>  e2fsck/e2fsck.h         |    3 ++
>  e2fsck/pass1.c          |   65 +++++++++++++++++++++++++++++++++++++++++++++++
>  e2fsck/pass2.c          |   38 +++++++++++++++++++++++++++
>  e2fsck/pass4.c          |    9 +++++++
>  e2fsck/unix.c           |   28 ++++++++++++++++++++
>  lib/ext2fs/ext2fs.h     |    1 +
>  lib/ext2fs/inode.c      |    3 +-
>  9 files changed, 167 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/e2fsck/e2fsck.8.in b/e2fsck/e2fsck.8.in
> index f5ed758..84ae50f 100644
> --- a/e2fsck/e2fsck.8.in
> +++ b/e2fsck/e2fsck.8.in
> @@ -207,6 +207,13 @@ option may prevent you from further manual data recovery.
>  .BI nodiscard
>  Do not attempt to discard free blocks and unused inode blocks. This option is
>  exactly the opposite of discard option. This is set as default.
> +.TP
> +.BI readahead_kb
> +Use this many KiB of memory to pre-fetch metadata in the hopes of reducing
> +e2fsck runtime.  By default, this is set to the size of a block group's inode
> +table (typically 2MiB on a regular ext4 filesystem); if this amount is more
> +than 1/100 of total physical memory, readahead is disabled.  Set this to zero
> +to disable readahead entirely.
>  .RE
>  .TP
>  .B \-f
> diff --git a/e2fsck/e2fsck.conf.5.in b/e2fsck/e2fsck.conf.5.in
> index 9ebfbbf..e1d0518 100644
> --- a/e2fsck/e2fsck.conf.5.in
> +++ b/e2fsck/e2fsck.conf.5.in
> @@ -205,6 +205,21 @@ of that type are squelched.  This can be useful if the console is slow
>  (i.e., connected to a serial port) and so a large amount of output could
>  end up delaying the boot process for a long time (potentially hours).
>  .TP
> +.I readahead_mem_pct
> +Use this percentage of memory to try to read in metadata blocks ahead of the
> +main e2fsck thread.  This should reduce run times, depending on the speed of
> +the underlying storage and the amount of free memory.  There is no default, but
> +see
> +.B readahead_mem_pct
> +for more details.
> +.TP
> +.I readahead_kb
> +Use this amount of memory to read in metadata blocks ahead of the main checking
> +thread.  Setting this value to zero disables readahead entirely.  By default,
> +this is set the size of one block group's inode table (typically 2MiB on a
> +regular ext4 filesystem); if this amount is more than 1/100th of total physical
> +memory, readahead is disabled.
> +.TP
>  .I report_features
>  If this boolean relation is true, e2fsck will print the file system
>  features as part of its verbose reporting (i.e., if the
> diff --git a/e2fsck/e2fsck.h b/e2fsck/e2fsck.h
> index 0252824..e359515 100644
> --- a/e2fsck/e2fsck.h
> +++ b/e2fsck/e2fsck.h
> @@ -378,6 +378,9 @@ struct e2fsck_struct {
>  	 */
>  	void *priv_data;
>  	ext2fs_block_bitmap block_metadata_map; /* Metadata blocks */
> +
> +	/* How much are we allowed to readahead? */
> +	unsigned long long readahead_kb;
>  };
>  
>  /* Used by the region allocation code */
> diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
> index 4cc58c4..a963849 100644
> --- a/e2fsck/pass1.c
> +++ b/e2fsck/pass1.c
> @@ -868,6 +868,60 @@ out:
>  	return 0;
>  }
>  
> +static void pass1_readahead(e2fsck_t ctx, dgrp_t *group, ext2_ino_t *next_ino)
> +{
> +	ext2_ino_t inodes_in_group = 0, inodes_per_block, inodes_per_buffer;
> +	dgrp_t start = *group, grp;
> +	blk64_t blocks_to_read = 0;
> +	errcode_t err = EXT2_ET_INVALID_ARGUMENT;
> +
> +	if (ctx->readahead_kb == 0)
> +		goto out;
> +
> +	/* Keep iterating groups until we have enough to readahead */
> +	inodes_per_block = EXT2_INODES_PER_BLOCK(ctx->fs->super);
> +	for (grp = start; grp < ctx->fs->group_desc_count; grp++) {
> +		if (ext2fs_bg_flags_test(ctx->fs, grp, EXT2_BG_INODE_UNINIT))
> +			continue;
> +		inodes_in_group = ctx->fs->super->s_inodes_per_group -
> +					ext2fs_bg_itable_unused(ctx->fs, grp);
> +		blocks_to_read += (inodes_in_group + inodes_per_block - 1) /
> +					inodes_per_block;
> +		if (blocks_to_read * ctx->fs->blocksize >
> +		    ctx->readahead_kb * 1024)
> +			break;
> +	}
> +
> +	err = e2fsck_readahead(ctx->fs, E2FSCK_READA_ITABLE, start,
> +			       grp - start + 1);
> +	if (err == EAGAIN) {
> +		ctx->readahead_kb /= 2;
> +		err = 0;
> +	}
> +
> +out:
> +	if (err) {
> +		/* Error; disable itable readahead */
> +		*group = ctx->fs->group_desc_count;
> +		*next_ino = ctx->fs->super->s_inodes_count;
> +	} else {
> +		/*
> +		 * Don't do more readahead until we've reached the first inode
> +		 * of the last inode scan buffer block for the last group.
> +		 */
> +		*group = grp + 1;
> +		inodes_per_buffer = (ctx->inode_buffer_blocks ?
> +				     ctx->inode_buffer_blocks :
> +				     EXT2_INODE_SCAN_DEFAULT_BUFFER_BLOCKS) *
> +				    ctx->fs->blocksize /
> +				    EXT2_INODE_SIZE(ctx->fs->super);
> +		inodes_in_group--;
> +		*next_ino = inodes_in_group -
> +			    (inodes_in_group % inodes_per_buffer) + 1 +
> +			    (grp * ctx->fs->super->s_inodes_per_group);
> +	}
> +}
> +
>  void e2fsck_pass1(e2fsck_t ctx)
>  {
>  	int	i;
> @@ -890,10 +944,19 @@ void e2fsck_pass1(e2fsck_t ctx)
>  	int		low_dtime_check = 1;
>  	int		inode_size;
>  	int		failed_csum = 0;
> +	ext2_ino_t	ino_threshold = 0;
> +	dgrp_t		ra_group = 0;
>  
>  	init_resource_track(&rtrack, ctx->fs->io);
>  	clear_problem_context(&pctx);
>  
> +	/* If we can do readahead, figure out how many groups to pull in. */
> +	if (!e2fsck_can_readahead(ctx->fs))
> +		ctx->readahead_kb = 0;
> +	else if (ctx->readahead_kb == ~0ULL)
> +		ctx->readahead_kb = e2fsck_guess_readahead(ctx->fs);
> +	pass1_readahead(ctx, &ra_group, &ino_threshold);
> +
>  	if (!(ctx->options & E2F_OPT_PREEN))
>  		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
>  
> @@ -1073,6 +1136,8 @@ void e2fsck_pass1(e2fsck_t ctx)
>  		old_op = ehandler_operation(_("getting next inode from scan"));
>  		pctx.errcode = ext2fs_get_next_inode_full(scan, &ino,
>  							  inode, inode_size);
> +		if (ino > ino_threshold)
> +			pass1_readahead(ctx, &ra_group, &ino_threshold);
>  		ehandler_operation(old_op);
>  		if (ctx->flags & E2F_FLAG_SIGNAL_MASK)
>  			return;
> diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
> index 7aaebce..cffaac4 100644
> --- a/e2fsck/pass2.c
> +++ b/e2fsck/pass2.c
> @@ -61,6 +61,9 @@
>   * Keeps track of how many times an inode is referenced.
>   */
>  static void deallocate_inode(e2fsck_t ctx, ext2_ino_t ino, char* block_buf);
> +static int check_dir_block2(ext2_filsys fs,
> +			   struct ext2_db_entry2 *dir_blocks_info,
> +			   void *priv_data);
>  static int check_dir_block(ext2_filsys fs,
>  			   struct ext2_db_entry2 *dir_blocks_info,
>  			   void *priv_data);
> @@ -77,6 +80,9 @@ struct check_dir_struct {
>  	struct problem_context	pctx;
>  	int	count, max;
>  	e2fsck_t ctx;
> +	unsigned long long list_offset;
> +	unsigned long long ra_entries;
> +	unsigned long long next_ra_off;
>  };
>  
>  void e2fsck_pass2(e2fsck_t ctx)
> @@ -96,6 +102,9 @@ void e2fsck_pass2(e2fsck_t ctx)
>  	int			i, depth;
>  	problem_t		code;
>  	int			bad_dir;
> +	int (*check_dir_func)(ext2_filsys fs,
> +			      struct ext2_db_entry2 *dir_blocks_info,
> +			      void *priv_data);
>  
>  	init_resource_track(&rtrack, ctx->fs->io);
>  	clear_problem_context(&cd.pctx);
> @@ -139,6 +148,9 @@ void e2fsck_pass2(e2fsck_t ctx)
>  	cd.ctx = ctx;
>  	cd.count = 1;
>  	cd.max = ext2fs_dblist_count2(fs->dblist);
> +	cd.list_offset = 0;
> +	cd.ra_entries = ctx->readahead_kb * 1024 / ctx->fs->blocksize;
> +	cd.next_ra_off = 0;
>  
>  	if (ctx->progress)
>  		(void) (ctx->progress)(ctx, 2, 0, cd.max);
> @@ -146,7 +158,8 @@ void e2fsck_pass2(e2fsck_t ctx)
>  	if (fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX)
>  		ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
>  
> -	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_block,
> +	check_dir_func = cd.ra_entries ? check_dir_block2 : check_dir_block;
> +	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_func,
>  						 &cd);
>  	if (ctx->flags & E2F_FLAG_SIGNAL_MASK || ctx->flags & E2F_FLAG_RESTART)
>  		return;
> @@ -825,6 +838,29 @@ err:
>  	return retval;
>  }
>  
> +static int check_dir_block2(ext2_filsys fs,
> +			   struct ext2_db_entry2 *db,
> +			   void *priv_data)
> +{
> +	int err;
> +	struct check_dir_struct *cd = priv_data;
> +
> +	if (cd->ra_entries && cd->list_offset >= cd->next_ra_off) {
> +		err = e2fsck_readahead_dblist(fs,
> +					E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT,
> +					fs->dblist,
> +					cd->list_offset + cd->ra_entries / 8,
> +					cd->ra_entries);
> +		if (err)
> +			cd->ra_entries = 0;
> +		cd->next_ra_off = cd->list_offset + (cd->ra_entries * 7 / 8);
> +	}
> +
> +	err = check_dir_block(fs, db, priv_data);
> +	cd->list_offset++;
> +	return err;
> +}
> +
>  static int check_dir_block(ext2_filsys fs,
>  			   struct ext2_db_entry2 *db,
>  			   void *priv_data)
> diff --git a/e2fsck/pass4.c b/e2fsck/pass4.c
> index 21d93f0..bc9a2c4 100644
> --- a/e2fsck/pass4.c
> +++ b/e2fsck/pass4.c
> @@ -106,6 +106,15 @@ void e2fsck_pass4(e2fsck_t ctx)
>  #ifdef MTRACE
>  	mtrace_print("Pass 4");
>  #endif
> +	/*
> +	 * Since pass4 is mostly CPU bound, start readahead of bitmaps
> +	 * ahead of pass 5 if we haven't already loaded them.
> +	 */
> +	if (ctx->readahead_kb &&
> +	    (fs->block_map == NULL || fs->inode_map == NULL))
> +		e2fsck_readahead(fs, E2FSCK_READA_BBITMAP |
> +				     E2FSCK_READA_IBITMAP,
> +				 0, fs->group_desc_count);
>  
>  	clear_problem_context(&pctx);
>  
> diff --git a/e2fsck/unix.c b/e2fsck/unix.c
> index 615d690..f3672c0 100644
> --- a/e2fsck/unix.c
> +++ b/e2fsck/unix.c
> @@ -650,6 +650,7 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
>  	char	*buf, *token, *next, *p, *arg;
>  	int	ea_ver;
>  	int	extended_usage = 0;
> +	unsigned long long reada_kb;
>  
>  	buf = string_copy(ctx, opts, 0);
>  	for (token = buf; token && *token; token = next) {
> @@ -678,6 +679,15 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
>  				continue;
>  			}
>  			ctx->ext_attr_ver = ea_ver;
> +		} else if (strcmp(token, "readahead_kb") == 0) {
> +			reada_kb = strtoull(arg, &p, 0);
> +			if (*p) {
> +				fprintf(stderr, "%s",
> +					_("Invalid readahead buffer size.\n"));
> +				extended_usage++;
> +				continue;
> +			}
> +			ctx->readahead_kb = reada_kb;
>  		} else if (strcmp(token, "fragcheck") == 0) {
>  			ctx->options |= E2F_OPT_FRAGCHECK;
>  			continue;
> @@ -717,6 +727,7 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
>  		fputs(("\tjournal_only\n"), stderr);
>  		fputs(("\tdiscard\n"), stderr);
>  		fputs(("\tnodiscard\n"), stderr);
> +		fputs(("\treadahead_kb=<buffer size>\n"), stderr);
>  		fputc('\n', stderr);
>  		exit(1);
>  	}
> @@ -750,6 +761,7 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
>  #ifdef CONFIG_JBD_DEBUG
>  	char 		*jbd_debug;
>  #endif
> +	unsigned long long phys_mem_kb;
>  
>  	retval = e2fsck_allocate_context(&ctx);
>  	if (retval)
> @@ -777,6 +789,8 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
>  	else
>  		ctx->program_name = "e2fsck";
>  
> +	phys_mem_kb = get_memory_size() / 1024;
> +	ctx->readahead_kb = ~0ULL;
>  	while ((c = getopt (argc, argv, "panyrcC:B:dE:fvtFVM:b:I:j:P:l:L:N:SsDk")) != EOF)
>  		switch (c) {
>  		case 'C':
> @@ -961,6 +975,20 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
>  	if (c)
>  		verbose = 1;
>  
> +	if (ctx->readahead_kb == ~0ULL) {
> +		profile_get_integer(ctx->profile, "options",
> +				    "readahead_mem_pct", 0, -1, &c);
> +		if (c >= 0 && c <= 100)
> +			ctx->readahead_kb = phys_mem_kb * c / 100;
> +		profile_get_integer(ctx->profile, "options",
> +				    "readahead_kb", 0, -1, &c);
> +		if (c >= 0)
> +			ctx->readahead_kb = c;
> +		if (ctx->readahead_kb != ~0ULL &&
> +		    ctx->readahead_kb > phys_mem_kb)
> +			ctx->readahead_kb = phys_mem_kb;
> +	}
> +
>  	/* Turn off discard in read-only mode */
>  	if ((ctx->options & E2F_OPT_NO) &&
>  	    (ctx->options & E2F_OPT_DISCARD))
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 34b9132..84e4e1f 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -1418,6 +1418,7 @@ extern errcode_t ext2fs_get_next_inode_full(ext2_inode_scan scan,
>  					    ext2_ino_t *ino,
>  					    struct ext2_inode *inode,
>  					    int bufsize);
> +#define EXT2_INODE_SCAN_DEFAULT_BUFFER_BLOCKS	8
>  extern errcode_t ext2fs_open_inode_scan(ext2_filsys fs, int buffer_blocks,
>  				  ext2_inode_scan *ret_scan);
>  extern void ext2fs_close_inode_scan(ext2_inode_scan scan);
> diff --git a/lib/ext2fs/inode.c b/lib/ext2fs/inode.c
> index 4310b82..4b3e14e 100644
> --- a/lib/ext2fs/inode.c
> +++ b/lib/ext2fs/inode.c
> @@ -175,7 +175,8 @@ errcode_t ext2fs_open_inode_scan(ext2_filsys fs, int buffer_blocks,
>  	scan->bytes_left = 0;
>  	scan->current_group = 0;
>  	scan->groups_left = fs->group_desc_count - 1;
> -	scan->inode_buffer_blocks = buffer_blocks ? buffer_blocks : 8;
> +	scan->inode_buffer_blocks = buffer_blocks ? buffer_blocks :
> +				    EXT2_INODE_SCAN_DEFAULT_BUFFER_BLOCKS;
>  	scan->current_block = ext2fs_inode_table_loc(scan->fs,
>  						     scan->current_group);
>  	scan->inodes_left = EXT2_INODES_PER_GROUP(scan->fs->super);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ