[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141210202756.GA21938@birch.djwong.org>
Date: Wed, 10 Dec 2014 12:27:56 -0800
From: "Darrick J. Wong" <darrick.wong@...cle.com>
To: tytso@....edu
Cc: linux-ext4@...r.kernel.org
Subject: Re: [PATCH 32/47] e2fsck: read-ahead metadata during passes 1, 2,
and 4
On Fri, Nov 07, 2014 at 01:54:14PM -0800, Darrick J. Wong wrote:
> e2fsck pass1 is modified to use the block group data prefetch function
> to try to fetch the inode tables into the pagecache before it is
> needed. We iterate through the blockgroups until we have enough inode
> tables that need reading such that we can issue readahead; then we sit
> and wait until the last inode table block read of the last group to
> start fetching the next bunch.
>
> pass2 is modified to use the dirblock prefetching function to prefetch
> the list of directory blocks that are assembled in pass1. We use the
> "iterate a subset of a dblist" and avoid copying the dblist. Directory
> blocks are fetched incrementally as we walk through the directory
> block list. In previous iterations of this patch we would free the
> directory blocks after processing, but the performance hit to e2fsck
> itself wasn't worth it. Furthermore, it is anticipated that most
> users will then mount the FS and start using the directories, so they
> may as well remain in the page cache.
>
> pass4 is modified to prefetch the block and inode bitmaps in
> anticipation of pass 5, because pass4 is entirely CPU bound.
>
> In general, these mechanisms can decrease fsck time by 10-40%, if the
> host system has sufficient memory and the storage system can provide a
> lot of IOPs. Pretty much any storage system capable of handling
> multiple IOs in-flight at any time will see a fairly large performance
> boost. (Single-issue USB mass storage disks seem to suffer badly.)
FWIW I finally got UAS working in Linux; with that hardware and the default
readahead size, I see about a 5% reduction in fsck runtime. The same dock
plugged into a USB2 port (i.e. BOT) shows about a 2% increase in runtime.
--D
>
> By default, the readahead buffer size will be set to the size of a block
> group's inode table (which is 2MiB for a regular ext4 FS). The -E
> readahead_kb= option can be given to specify the amount of memory to
> use for readahead or zero to disable it entirely; or an option can be
> given in e2fsck.conf.
>
> v2: Fix an off-by-one error in the pass1 readahead which made the
> readahead trigger one inode too late if the block groups are full.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@...cle.com>
> ---
> e2fsck/e2fsck.8.in | 7 +++++
> e2fsck/e2fsck.conf.5.in | 15 +++++++++++
> e2fsck/e2fsck.h | 3 ++
> e2fsck/pass1.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++
> e2fsck/pass2.c | 38 +++++++++++++++++++++++++++
> e2fsck/pass4.c | 9 +++++++
> e2fsck/unix.c | 28 ++++++++++++++++++++
> lib/ext2fs/ext2fs.h | 1 +
> lib/ext2fs/inode.c | 3 +-
> 9 files changed, 167 insertions(+), 2 deletions(-)
>
>
> diff --git a/e2fsck/e2fsck.8.in b/e2fsck/e2fsck.8.in
> index f5ed758..84ae50f 100644
> --- a/e2fsck/e2fsck.8.in
> +++ b/e2fsck/e2fsck.8.in
> @@ -207,6 +207,13 @@ option may prevent you from further manual data recovery.
> .BI nodiscard
> Do not attempt to discard free blocks and unused inode blocks. This option is
> exactly the opposite of discard option. This is set as default.
> +.TP
> +.BI readahead_kb
> +Use this many KiB of memory to pre-fetch metadata in the hopes of reducing
> +e2fsck runtime. By default, this is set to the size of a block group's inode
> +table (typically 2MiB on a regular ext4 filesystem); if this amount is more
> +than 1/100 of total physical memory, readahead is disabled. Set this to zero
> +to disable readahead entirely.
> .RE
> .TP
> .B \-f
> diff --git a/e2fsck/e2fsck.conf.5.in b/e2fsck/e2fsck.conf.5.in
> index 9ebfbbf..e1d0518 100644
> --- a/e2fsck/e2fsck.conf.5.in
> +++ b/e2fsck/e2fsck.conf.5.in
> @@ -205,6 +205,21 @@ of that type are squelched. This can be useful if the console is slow
> (i.e., connected to a serial port) and so a large amount of output could
> end up delaying the boot process for a long time (potentially hours).
> .TP
> +.I readahead_mem_pct
> +Use this percentage of memory to try to read in metadata blocks ahead of the
> +main e2fsck thread. This should reduce run times, depending on the speed of
> +the underlying storage and the amount of free memory. There is no default, but
> +see
> +.B readahead_mem_pct
> +for more details.
> +.TP
> +.I readahead_kb
> +Use this amount of memory to read in metadata blocks ahead of the main checking
> +thread. Setting this value to zero disables readahead entirely. By default,
> +this is set the size of one block group's inode table (typically 2MiB on a
> +regular ext4 filesystem); if this amount is more than 1/100th of total physical
> +memory, readahead is disabled.
> +.TP
> .I report_features
> If this boolean relation is true, e2fsck will print the file system
> features as part of its verbose reporting (i.e., if the
> diff --git a/e2fsck/e2fsck.h b/e2fsck/e2fsck.h
> index 0252824..e359515 100644
> --- a/e2fsck/e2fsck.h
> +++ b/e2fsck/e2fsck.h
> @@ -378,6 +378,9 @@ struct e2fsck_struct {
> */
> void *priv_data;
> ext2fs_block_bitmap block_metadata_map; /* Metadata blocks */
> +
> + /* How much are we allowed to readahead? */
> + unsigned long long readahead_kb;
> };
>
> /* Used by the region allocation code */
> diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
> index 4cc58c4..a963849 100644
> --- a/e2fsck/pass1.c
> +++ b/e2fsck/pass1.c
> @@ -868,6 +868,60 @@ out:
> return 0;
> }
>
> +static void pass1_readahead(e2fsck_t ctx, dgrp_t *group, ext2_ino_t *next_ino)
> +{
> + ext2_ino_t inodes_in_group = 0, inodes_per_block, inodes_per_buffer;
> + dgrp_t start = *group, grp;
> + blk64_t blocks_to_read = 0;
> + errcode_t err = EXT2_ET_INVALID_ARGUMENT;
> +
> + if (ctx->readahead_kb == 0)
> + goto out;
> +
> + /* Keep iterating groups until we have enough to readahead */
> + inodes_per_block = EXT2_INODES_PER_BLOCK(ctx->fs->super);
> + for (grp = start; grp < ctx->fs->group_desc_count; grp++) {
> + if (ext2fs_bg_flags_test(ctx->fs, grp, EXT2_BG_INODE_UNINIT))
> + continue;
> + inodes_in_group = ctx->fs->super->s_inodes_per_group -
> + ext2fs_bg_itable_unused(ctx->fs, grp);
> + blocks_to_read += (inodes_in_group + inodes_per_block - 1) /
> + inodes_per_block;
> + if (blocks_to_read * ctx->fs->blocksize >
> + ctx->readahead_kb * 1024)
> + break;
> + }
> +
> + err = e2fsck_readahead(ctx->fs, E2FSCK_READA_ITABLE, start,
> + grp - start + 1);
> + if (err == EAGAIN) {
> + ctx->readahead_kb /= 2;
> + err = 0;
> + }
> +
> +out:
> + if (err) {
> + /* Error; disable itable readahead */
> + *group = ctx->fs->group_desc_count;
> + *next_ino = ctx->fs->super->s_inodes_count;
> + } else {
> + /*
> + * Don't do more readahead until we've reached the first inode
> + * of the last inode scan buffer block for the last group.
> + */
> + *group = grp + 1;
> + inodes_per_buffer = (ctx->inode_buffer_blocks ?
> + ctx->inode_buffer_blocks :
> + EXT2_INODE_SCAN_DEFAULT_BUFFER_BLOCKS) *
> + ctx->fs->blocksize /
> + EXT2_INODE_SIZE(ctx->fs->super);
> + inodes_in_group--;
> + *next_ino = inodes_in_group -
> + (inodes_in_group % inodes_per_buffer) + 1 +
> + (grp * ctx->fs->super->s_inodes_per_group);
> + }
> +}
> +
> void e2fsck_pass1(e2fsck_t ctx)
> {
> int i;
> @@ -890,10 +944,19 @@ void e2fsck_pass1(e2fsck_t ctx)
> int low_dtime_check = 1;
> int inode_size;
> int failed_csum = 0;
> + ext2_ino_t ino_threshold = 0;
> + dgrp_t ra_group = 0;
>
> init_resource_track(&rtrack, ctx->fs->io);
> clear_problem_context(&pctx);
>
> + /* If we can do readahead, figure out how many groups to pull in. */
> + if (!e2fsck_can_readahead(ctx->fs))
> + ctx->readahead_kb = 0;
> + else if (ctx->readahead_kb == ~0ULL)
> + ctx->readahead_kb = e2fsck_guess_readahead(ctx->fs);
> + pass1_readahead(ctx, &ra_group, &ino_threshold);
> +
> if (!(ctx->options & E2F_OPT_PREEN))
> fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
>
> @@ -1073,6 +1136,8 @@ void e2fsck_pass1(e2fsck_t ctx)
> old_op = ehandler_operation(_("getting next inode from scan"));
> pctx.errcode = ext2fs_get_next_inode_full(scan, &ino,
> inode, inode_size);
> + if (ino > ino_threshold)
> + pass1_readahead(ctx, &ra_group, &ino_threshold);
> ehandler_operation(old_op);
> if (ctx->flags & E2F_FLAG_SIGNAL_MASK)
> return;
> diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
> index 7aaebce..cffaac4 100644
> --- a/e2fsck/pass2.c
> +++ b/e2fsck/pass2.c
> @@ -61,6 +61,9 @@
> * Keeps track of how many times an inode is referenced.
> */
> static void deallocate_inode(e2fsck_t ctx, ext2_ino_t ino, char* block_buf);
> +static int check_dir_block2(ext2_filsys fs,
> + struct ext2_db_entry2 *dir_blocks_info,
> + void *priv_data);
> static int check_dir_block(ext2_filsys fs,
> struct ext2_db_entry2 *dir_blocks_info,
> void *priv_data);
> @@ -77,6 +80,9 @@ struct check_dir_struct {
> struct problem_context pctx;
> int count, max;
> e2fsck_t ctx;
> + unsigned long long list_offset;
> + unsigned long long ra_entries;
> + unsigned long long next_ra_off;
> };
>
> void e2fsck_pass2(e2fsck_t ctx)
> @@ -96,6 +102,9 @@ void e2fsck_pass2(e2fsck_t ctx)
> int i, depth;
> problem_t code;
> int bad_dir;
> + int (*check_dir_func)(ext2_filsys fs,
> + struct ext2_db_entry2 *dir_blocks_info,
> + void *priv_data);
>
> init_resource_track(&rtrack, ctx->fs->io);
> clear_problem_context(&cd.pctx);
> @@ -139,6 +148,9 @@ void e2fsck_pass2(e2fsck_t ctx)
> cd.ctx = ctx;
> cd.count = 1;
> cd.max = ext2fs_dblist_count2(fs->dblist);
> + cd.list_offset = 0;
> + cd.ra_entries = ctx->readahead_kb * 1024 / ctx->fs->blocksize;
> + cd.next_ra_off = 0;
>
> if (ctx->progress)
> (void) (ctx->progress)(ctx, 2, 0, cd.max);
> @@ -146,7 +158,8 @@ void e2fsck_pass2(e2fsck_t ctx)
> if (fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX)
> ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
>
> - cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_block,
> + check_dir_func = cd.ra_entries ? check_dir_block2 : check_dir_block;
> + cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_func,
> &cd);
> if (ctx->flags & E2F_FLAG_SIGNAL_MASK || ctx->flags & E2F_FLAG_RESTART)
> return;
> @@ -825,6 +838,29 @@ err:
> return retval;
> }
>
> +static int check_dir_block2(ext2_filsys fs,
> + struct ext2_db_entry2 *db,
> + void *priv_data)
> +{
> + int err;
> + struct check_dir_struct *cd = priv_data;
> +
> + if (cd->ra_entries && cd->list_offset >= cd->next_ra_off) {
> + err = e2fsck_readahead_dblist(fs,
> + E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT,
> + fs->dblist,
> + cd->list_offset + cd->ra_entries / 8,
> + cd->ra_entries);
> + if (err)
> + cd->ra_entries = 0;
> + cd->next_ra_off = cd->list_offset + (cd->ra_entries * 7 / 8);
> + }
> +
> + err = check_dir_block(fs, db, priv_data);
> + cd->list_offset++;
> + return err;
> +}
> +
> static int check_dir_block(ext2_filsys fs,
> struct ext2_db_entry2 *db,
> void *priv_data)
> diff --git a/e2fsck/pass4.c b/e2fsck/pass4.c
> index 21d93f0..bc9a2c4 100644
> --- a/e2fsck/pass4.c
> +++ b/e2fsck/pass4.c
> @@ -106,6 +106,15 @@ void e2fsck_pass4(e2fsck_t ctx)
> #ifdef MTRACE
> mtrace_print("Pass 4");
> #endif
> + /*
> + * Since pass4 is mostly CPU bound, start readahead of bitmaps
> + * ahead of pass 5 if we haven't already loaded them.
> + */
> + if (ctx->readahead_kb &&
> + (fs->block_map == NULL || fs->inode_map == NULL))
> + e2fsck_readahead(fs, E2FSCK_READA_BBITMAP |
> + E2FSCK_READA_IBITMAP,
> + 0, fs->group_desc_count);
>
> clear_problem_context(&pctx);
>
> diff --git a/e2fsck/unix.c b/e2fsck/unix.c
> index 615d690..f3672c0 100644
> --- a/e2fsck/unix.c
> +++ b/e2fsck/unix.c
> @@ -650,6 +650,7 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
> char *buf, *token, *next, *p, *arg;
> int ea_ver;
> int extended_usage = 0;
> + unsigned long long reada_kb;
>
> buf = string_copy(ctx, opts, 0);
> for (token = buf; token && *token; token = next) {
> @@ -678,6 +679,15 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
> continue;
> }
> ctx->ext_attr_ver = ea_ver;
> + } else if (strcmp(token, "readahead_kb") == 0) {
> + reada_kb = strtoull(arg, &p, 0);
> + if (*p) {
> + fprintf(stderr, "%s",
> + _("Invalid readahead buffer size.\n"));
> + extended_usage++;
> + continue;
> + }
> + ctx->readahead_kb = reada_kb;
> } else if (strcmp(token, "fragcheck") == 0) {
> ctx->options |= E2F_OPT_FRAGCHECK;
> continue;
> @@ -717,6 +727,7 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
> fputs(("\tjournal_only\n"), stderr);
> fputs(("\tdiscard\n"), stderr);
> fputs(("\tnodiscard\n"), stderr);
> + fputs(("\treadahead_kb=<buffer size>\n"), stderr);
> fputc('\n', stderr);
> exit(1);
> }
> @@ -750,6 +761,7 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
> #ifdef CONFIG_JBD_DEBUG
> char *jbd_debug;
> #endif
> + unsigned long long phys_mem_kb;
>
> retval = e2fsck_allocate_context(&ctx);
> if (retval)
> @@ -777,6 +789,8 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
> else
> ctx->program_name = "e2fsck";
>
> + phys_mem_kb = get_memory_size() / 1024;
> + ctx->readahead_kb = ~0ULL;
> while ((c = getopt (argc, argv, "panyrcC:B:dE:fvtFVM:b:I:j:P:l:L:N:SsDk")) != EOF)
> switch (c) {
> case 'C':
> @@ -961,6 +975,20 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
> if (c)
> verbose = 1;
>
> + if (ctx->readahead_kb == ~0ULL) {
> + profile_get_integer(ctx->profile, "options",
> + "readahead_mem_pct", 0, -1, &c);
> + if (c >= 0 && c <= 100)
> + ctx->readahead_kb = phys_mem_kb * c / 100;
> + profile_get_integer(ctx->profile, "options",
> + "readahead_kb", 0, -1, &c);
> + if (c >= 0)
> + ctx->readahead_kb = c;
> + if (ctx->readahead_kb != ~0ULL &&
> + ctx->readahead_kb > phys_mem_kb)
> + ctx->readahead_kb = phys_mem_kb;
> + }
> +
> /* Turn off discard in read-only mode */
> if ((ctx->options & E2F_OPT_NO) &&
> (ctx->options & E2F_OPT_DISCARD))
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 34b9132..84e4e1f 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -1418,6 +1418,7 @@ extern errcode_t ext2fs_get_next_inode_full(ext2_inode_scan scan,
> ext2_ino_t *ino,
> struct ext2_inode *inode,
> int bufsize);
> +#define EXT2_INODE_SCAN_DEFAULT_BUFFER_BLOCKS 8
> extern errcode_t ext2fs_open_inode_scan(ext2_filsys fs, int buffer_blocks,
> ext2_inode_scan *ret_scan);
> extern void ext2fs_close_inode_scan(ext2_inode_scan scan);
> diff --git a/lib/ext2fs/inode.c b/lib/ext2fs/inode.c
> index 4310b82..4b3e14e 100644
> --- a/lib/ext2fs/inode.c
> +++ b/lib/ext2fs/inode.c
> @@ -175,7 +175,8 @@ errcode_t ext2fs_open_inode_scan(ext2_filsys fs, int buffer_blocks,
> scan->bytes_left = 0;
> scan->current_group = 0;
> scan->groups_left = fs->group_desc_count - 1;
> - scan->inode_buffer_blocks = buffer_blocks ? buffer_blocks : 8;
> + scan->inode_buffer_blocks = buffer_blocks ? buffer_blocks :
> + EXT2_INODE_SCAN_DEFAULT_BUFFER_BLOCKS;
> scan->current_block = ext2fs_inode_table_loc(scan->fs,
> scan->current_group);
> scan->inodes_left = EXT2_INODES_PER_GROUP(scan->fs->super);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists