[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87vbmkpm2p.fsf@openvz.org>
Date: Wed, 12 Nov 2014 16:47:42 +0300
From: Dmitry Monakhov <dmonlist@...il.com>
To: Theodore Ts'o <tytso@....edu>,
Ext4 Developers List <linux-ext4@...r.kernel.org>
Cc: Theodore Ts'o <tytso@....edu>
Subject: Re: [PATCH,RFC] ext4: add lazytime mount option
Theodore Ts'o <tytso@....edu> writes:
> Add a new mount option which enables a new "lazytime" mode. This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode. The on-disk times will only get
> updated when (a) when the inode table block for the inode needs to be
> updated for some non-time related change involving any inode in the
> block, (b) if userspace calls fsync(), or (c) the refcount on an
> undeleted inode goes to zero (in most cases, when the last file
> descriptor assoicated with the inode is closed).
>
> This is legal according to POSIX because there are no guarantees after
> a crash unless userspace explicitly requests via a fsync(2) call. So
> in fact, this a better way of reducing the disk traffic resulting from
> atime is use lazytime instead of relatime or noatime. Enabling
> lazytime and disabling the default realtime will result in fewer extra
> disk writes, and has the benefits of being POSIX-compliant --- since
> either noatime and relatime violates POSIX.
Indeed, in fact even current implementation can not guarantee correct
mtime in case of power-fail for example:
DIO_WRITE_TASK
->file_update_time:
journal_start
save mtime in journal in tid:B
journal_stop
->direct_IO(): modify files's content which may be flushed to non
volatile storage.
POWER_FAILURE
NEW_MOUNT: journal replay will find that tid:B has no commit record and
ignore it. As result file has old mtime, but new content
>
> The lazytime mode reduces pressure on the journal spinlocks, since
> time updates resulting from calls to file_update_time() are almost
> always done using separate jbd2 handles. For allocating writes, the
> inode will need to be updated anyway when i_blocks change, and so the
> mtime updates will be folded into jbd2 handle in the ext4 write path.
>
> In addition, for workloads feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table. The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives. Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
Also sync mtime updates is a great pain for AIO submitter
because AIO submission may be blocked for a seconds (up to 5 second in my case)
if inode is part of current committing transaction see: do_get_write_access
>
> n.b.: because of the many wins of this mode, we may want to enable
> lazytime updates by default in the future. If you know of use cases
> where having inaccurate mtime values after a crash would be extremely
> problematic, please us know at linux-ext4@...r.kernel.org.
>
> Google-Bug-Id: 18297052
Yeah we also has ticket for that :)
https://jira.sw.ru/browse/PSBM-20411
>
> Signed-off-by: Theodore Ts'o <tytso@....edu>
Patch looks good with minor nodes, please see below.
> ---
> fs/ext4/ext4.h | 3 +++
> fs/ext4/file.c | 1 +
> fs/ext4/fsync.c | 3 +++
> fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/namei.c | 2 ++
> fs/ext4/super.c | 14 ++++++++++++
> fs/ext4/symlink.c | 2 ++
> fs/inode.c | 36 +++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 9 files changed, 124 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c55a1fa..494c504 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -970,6 +970,7 @@ struct ext4_inode_info {
> #define EXT4_MOUNT_ERRORS_MASK 0x00070
> #define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
> #define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
> +#define EXT4_MOUNT_LAZYTIME 0x00200 /* Update the time lazily */
> #define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
> #define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
> #define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
> @@ -1407,6 +1408,7 @@ enum {
> EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
> EXT4_STATE_ORDERED_MODE, /* data=ordered mode */
> EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
> + EXT4_STATE_DIRTY_TIME, /* the time needs to be updated */
> };
>
> #define EXT4_INODE_BIT_FNS(name, field, offset) \
> @@ -2114,6 +2116,7 @@ extern int ext4_write_inode(struct inode *, struct writeback_control *);
> extern int ext4_setattr(struct dentry *, struct iattr *);
> extern int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
> struct kstat *stat);
> +extern int ext4_update_time(struct inode *, struct timespec *, int);
> extern void ext4_evict_inode(struct inode *);
> extern void ext4_clear_inode(struct inode *);
> extern int ext4_sync_inode(handle_t *, struct inode *);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8131be8..2cf6aaf 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -603,6 +603,7 @@ const struct file_operations ext4_file_operations = {
> const struct inode_operations ext4_file_inode_operations = {
> .setattr = ext4_setattr,
> .getattr = ext4_getattr,
> + .update_time = ext4_update_time,
> .setxattr = generic_setxattr,
> .getxattr = generic_getxattr,
> .listxattr = ext4_listxattr,
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index a8bc47f..ba05c83 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -116,6 +116,9 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> if (ret)
> return ret;
> +
> + if (!datasync && ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME))
> + ext4_dirty_inode(inode, 0);
> /*
> * data=writeback,ordered:
> * The caller's filemap_fdatawrite()/wait will sync the data.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3356ab5..1b5e4bd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4163,6 +4163,46 @@ static int ext4_inode_blocks_set(handle_t *handle,
> }
>
> /*
> + * Opportunistically update the other time fields for other inodes in
> + * the same inode table block.
> + */
> +static void ext4_update_other_inodes_time(struct super_block *sb,
> + unsigned long orig_ino, char *buf)
> +{
> + struct ext4_inode_info *ei;
> + struct ext4_inode *raw_inode;
> + unsigned long ino;
> + struct inode *inode;
> + int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> + int inode_size = EXT4_INODE_SIZE(sb);
> +
> + ino = orig_ino & ~(inodes_per_block - 1);
> + for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> + if (ino == orig_ino)
> + continue;
> + inode = find_active_inode_nowait(sb, ino);
> + if (!inode ||
> + !ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) {
> + iput(inode);
> + continue;
> + }
> + raw_inode = (struct ext4_inode *) buf;
> + ei = EXT4_I(inode);
> +
> + smp_mb__before_spinlock();
> + spin_lock(&ei->i_raw_lock);
> + ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_TIME);
> + EXT4_INODE_GET_XTIME(i_ctime, inode, raw_inode);
> + EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode);
> + EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
> + EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
> + ext4_inode_csum_set(inode, raw_inode, ei);
> + spin_unlock(&ei->i_raw_lock);
> + iput(inode);
> + }
> +}
> +
> +/*
> * Post the struct inode info into an on-disk inode location in the
> * buffer-cache. This gobbles the caller's reference to the
> * buffer_head in the inode location struct.
> @@ -4182,7 +4222,9 @@ static int ext4_do_update_inode(handle_t *handle,
> uid_t i_uid;
> gid_t i_gid;
>
> + smp_mb__before_spinlock();
> spin_lock(&ei->i_raw_lock);
> + ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_TIME);
>
> /* For fields not tracked in the in-memory inode,
> * initialise them to zero for new inodes. */
> @@ -4273,8 +4315,8 @@ static int ext4_do_update_inode(handle_t *handle,
> }
>
> ext4_inode_csum_set(inode, raw_inode, ei);
> -
> spin_unlock(&ei->i_raw_lock);
> + ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
>
> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> @@ -4622,6 +4664,24 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
> return 0;
> }
>
> +int ext4_update_time(struct inode *inode, struct timespec *time, int flags)
> +{
> + if (flags & S_ATIME)
> + inode->i_atime = *time;
> + if (flags & S_VERSION)
> + inode_inc_iversion(inode);
> + if (flags & S_CTIME)
> + inode->i_ctime = *time;
> + if (flags & S_MTIME)
> + inode->i_mtime = *time;
> + if (test_opt(inode->i_sb, LAZYTIME)) {
> + smp_wmb();
> + ext4_set_inode_state(inode, EXT4_STATE_DIRTY_TIME);
Since we want update all in-memory data we also have to explicitly update inode->i_version
Which was previously updated implicitly here:
mark_inode_dirty_sync()
->__mark_inode_dirty
->ext4_dirty_inode
->ext4_mark_inode_dirty
->ext4_mark_iloc_dirty
->inode_inc_iversion(inode);
> + } else
> + mark_inode_dirty_sync(inode);
> + return 0;
> +}
> +
> static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
> int pextents)
> {
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4262118..f782040 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3532,6 +3532,7 @@ const struct inode_operations ext4_dir_inode_operations = {
> .tmpfile = ext4_tmpfile,
> .rename2 = ext4_rename2,
> .setattr = ext4_setattr,
> + .update_time = ext4_update_time,
> .setxattr = generic_setxattr,
> .getxattr = generic_getxattr,
> .listxattr = ext4_listxattr,
> @@ -3545,6 +3546,7 @@ const struct inode_operations ext4_special_inode_operations = {
> .setattr = ext4_setattr,
> .setxattr = generic_setxattr,
> .getxattr = generic_getxattr,
> + .update_time = ext4_update_time,
> .listxattr = ext4_listxattr,
> .removexattr = generic_removexattr,
> .get_acl = ext4_get_acl,
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 2c9e686..16c9983 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -910,6 +910,14 @@ static int ext4_drop_inode(struct inode *inode)
> int drop = generic_drop_inode(inode);
>
> trace_ext4_drop_inode(inode, drop);
> + if (!drop && ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) {
> + atomic_inc(&inode->i_count);
> + spin_unlock(&inode->i_lock);
> + ext4_dirty_inode(inode, 0);
> + spin_lock(&inode->i_lock);
> + if (atomic_dec_and_test(&inode->i_count))
> + drop = generic_drop_inode(inode);
> + }
> return drop;
> }
>
> @@ -1142,6 +1150,7 @@ enum {
> Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> Opt_usrquota, Opt_grpquota, Opt_i_version,
> Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
> + Opt_lazytime, Opt_nolazytime,
> Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
> Opt_inode_readahead_blks, Opt_journal_ioprio,
> Opt_dioread_nolock, Opt_dioread_lock,
> @@ -1204,6 +1213,8 @@ static const match_table_t tokens = {
> {Opt_i_version, "i_version"},
> {Opt_stripe, "stripe=%u"},
> {Opt_delalloc, "delalloc"},
> + {Opt_lazytime, "lazytime"},
> + {Opt_nolazytime, "nolazytime"},
> {Opt_nodelalloc, "nodelalloc"},
> {Opt_removed, "mblk_io_submit"},
> {Opt_removed, "nomblk_io_submit"},
> @@ -1361,6 +1372,8 @@ static const struct mount_opts {
> MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
> {Opt_nodelalloc, EXT4_MOUNT_DELALLOC,
> MOPT_EXT4_ONLY | MOPT_CLEAR},
> + {Opt_lazytime, EXT4_MOUNT_LAZYTIME, MOPT_SET},
> + {Opt_nolazytime, EXT4_MOUNT_LAZYTIME, MOPT_CLEAR},
> {Opt_journal_checksum, EXT4_MOUNT_JOURNAL_CHECKSUM,
> MOPT_EXT4_ONLY | MOPT_SET},
> {Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
> @@ -3514,6 +3527,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> /* Set defaults before we parse the mount options */
> def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
> + set_opt(sb, LAZYTIME);
> set_opt(sb, INIT_INODE_TABLE);
> if (def_mount_opts & EXT4_DEFM_DEBUG)
> set_opt(sb, DEBUG);
> diff --git a/fs/ext4/symlink.c b/fs/ext4/symlink.c
> index ff37119..7c92b93 100644
> --- a/fs/ext4/symlink.c
> +++ b/fs/ext4/symlink.c
> @@ -35,6 +35,7 @@ const struct inode_operations ext4_symlink_inode_operations = {
> .follow_link = page_follow_link_light,
> .put_link = page_put_link,
> .setattr = ext4_setattr,
> + .update_time = ext4_update_time,
> .setxattr = generic_setxattr,
> .getxattr = generic_getxattr,
> .listxattr = ext4_listxattr,
> @@ -45,6 +46,7 @@ const struct inode_operations ext4_fast_symlink_inode_operations = {
> .readlink = generic_readlink,
> .follow_link = ext4_follow_link,
> .setattr = ext4_setattr,
> + .update_time = ext4_update_time,
> .setxattr = generic_setxattr,
> .getxattr = generic_getxattr,
> .listxattr = ext4_listxattr,
> diff --git a/fs/inode.c b/fs/inode.c
> index 26753ba..cde073a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1280,6 +1280,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
> }
> EXPORT_SYMBOL(ilookup);
>
> +/**
> + * find_active_inode_nowait - find an active inode in the inode cache
> + * @sb: super block of file system to search
> + * @ino: inode number to search for
> + *
> + * Search for an active inode @ino in the inode cache, and if the
> + * inode is in the cache, the inode is returned with an incremented
> + * reference count. If the inode is being freed or is newly
> + * initialized, return nothing instead of trying to wait for the inode
> + * initialization or destruction to be complete.
> + */
> +struct inode *find_active_inode_nowait(struct super_block *sb,
> + unsigned long ino)
> +{
> + struct hlist_head *head = inode_hashtable + hash(sb, ino);
> + struct inode *inode, *ret_inode = NULL;
> +
> + spin_lock(&inode_hash_lock);
> + hlist_for_each_entry(inode, head, i_hash) {
> + if ((inode->i_ino != ino) ||
> + (inode->i_sb != sb))
> + continue;
> + spin_lock(&inode->i_lock);
> + if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
> + __iget(inode);
> + ret_inode = inode;
> + }
> + spin_unlock(&inode->i_lock);
> + goto out;
> + }
> +out:
> + spin_unlock(&inode_hash_lock);
> + return ret_inode;
> +}
> +EXPORT_SYMBOL(find_active_inode_nowait);
> +
> int insert_inode_locked(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ab779e..b5e6b6b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2410,6 +2410,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
>
> extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
> extern struct inode * iget_locked(struct super_block *, unsigned long);
> +extern struct inode *find_active_inode_nowait(struct super_block *,
> + unsigned long);
> extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
> extern int insert_inode_locked(struct inode *);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Download attachment "signature.asc" of type "application/pgp-signature" (473 bytes)
Powered by blists - more mailing lists