lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 12 Nov 2014 16:47:42 +0300
From:	Dmitry Monakhov <dmonlist@...il.com>
To:	Theodore Ts'o <tytso@....edu>,
	Ext4 Developers List <linux-ext4@...r.kernel.org>
Cc:	Theodore Ts'o <tytso@....edu>
Subject: Re: [PATCH,RFC] ext4: add lazytime mount option

Theodore Ts'o <tytso@....edu> writes:

> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) when the inode table block for the inode needs to be
> updated for some non-time related change involving any inode in the
> block, (b) if userspace calls fsync(), or (c) the refcount on an
> undeleted inode goes to zero (in most cases, when the last file
> descriptor assoicated with the inode is closed).
>
> This is legal according to POSIX because there are no guarantees after
> a crash unless userspace explicitly requests via a fsync(2) call.  So
> in fact, this a better way of reducing the disk traffic resulting from
> atime is use lazytime instead of relatime or noatime.  Enabling
> lazytime and disabling the default realtime will result in fewer extra
> disk writes, and has the benefits of being POSIX-compliant --- since
> either noatime and relatime violates POSIX.
Indeed, in fact even current implementation can not guarantee correct
mtime in  case of power-fail for example:
DIO_WRITE_TASK
 ->file_update_time:
        journal_start
        save mtime in journal in tid:B
        journal_stop
 ->direct_IO(): modify files's content which may be flushed to non
                volatile storage.

POWER_FAILURE

NEW_MOUNT: journal replay will find that tid:B has no commit record and
ignore it. As result file has old mtime, but new content
>
> The lazytime mode reduces pressure on the journal spinlocks, since
> time updates resulting from calls to file_update_time() are almost
> always done using separate jbd2 handles.  For allocating writes, the
> inode will need to be updated anyway when i_blocks change, and so the
> mtime updates will be folded into jbd2 handle in the ext4 write path.
>
> In addition, for workloads feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
Also sync mtime updates is a great pain for AIO submitter
because AIO submission may be blocked for a seconds (up to 5 second in my case)
if inode is part of current committing transaction see: do_get_write_access
>
> n.b.: because of the many wins of this mode, we may want to enable
> lazytime updates by default in the future.  If you know of use cases
> where having inaccurate mtime values after a crash would be extremely
> problematic, please us know at linux-ext4@...r.kernel.org.
>
> Google-Bug-Id: 18297052
Yeah we also has ticket for that :)
https://jira.sw.ru/browse/PSBM-20411
>
> Signed-off-by: Theodore Ts'o <tytso@....edu>
Patch looks good with minor nodes, please see below.
> ---
>  fs/ext4/ext4.h     |  3 +++
>  fs/ext4/file.c     |  1 +
>  fs/ext4/fsync.c    |  3 +++
>  fs/ext4/inode.c    | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/namei.c    |  2 ++
>  fs/ext4/super.c    | 14 ++++++++++++
>  fs/ext4/symlink.c  |  2 ++
>  fs/inode.c         | 36 +++++++++++++++++++++++++++++++
>  include/linux/fs.h |  2 ++
>  9 files changed, 124 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c55a1fa..494c504 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -970,6 +970,7 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT_ERRORS_MASK		0x00070
>  #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
>  #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
> +#define EXT4_MOUNT_LAZYTIME		0x00200 /* Update the time lazily */
>  #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
>  #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
>  #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
> @@ -1407,6 +1408,7 @@ enum {
>  	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
>  	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
>  	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
> +	EXT4_STATE_DIRTY_TIME,		/* the time needs to be updated */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> @@ -2114,6 +2116,7 @@ extern int  ext4_write_inode(struct inode *, struct writeback_control *);
>  extern int  ext4_setattr(struct dentry *, struct iattr *);
>  extern int  ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
>  				struct kstat *stat);
> +extern int ext4_update_time(struct inode *, struct timespec *, int);
>  extern void ext4_evict_inode(struct inode *);
>  extern void ext4_clear_inode(struct inode *);
>  extern int  ext4_sync_inode(handle_t *, struct inode *);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8131be8..2cf6aaf 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -603,6 +603,7 @@ const struct file_operations ext4_file_operations = {
>  const struct inode_operations ext4_file_inode_operations = {
>  	.setattr	= ext4_setattr,
>  	.getattr	= ext4_getattr,
> +	.update_time	= ext4_update_time,
>  	.setxattr	= generic_setxattr,
>  	.getxattr	= generic_getxattr,
>  	.listxattr	= ext4_listxattr,
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index a8bc47f..ba05c83 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -116,6 +116,9 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
>  	if (ret)
>  		return ret;
> +
> +	if (!datasync && ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME))
> +		ext4_dirty_inode(inode, 0);
>  	/*
>  	 * data=writeback,ordered:
>  	 *  The caller's filemap_fdatawrite()/wait will sync the data.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3356ab5..1b5e4bd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4163,6 +4163,46 @@ static int ext4_inode_blocks_set(handle_t *handle,
>  }
>  
>  /*
> + * Opportunistically update the other time fields for other inodes in
> + * the same inode table block.
> + */
> +static void ext4_update_other_inodes_time(struct super_block *sb,
> +					  unsigned long orig_ino, char *buf)
> +{
> +	struct ext4_inode_info	*ei;
> +	struct ext4_inode	*raw_inode;
> +	unsigned long		ino;
> +	struct inode		*inode;
> +	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> +	int		inode_size = EXT4_INODE_SIZE(sb);
> +
> +	ino = orig_ino & ~(inodes_per_block - 1);
> +	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> +		if (ino == orig_ino)
> +			continue;
> +		inode = find_active_inode_nowait(sb, ino);
> +		if (!inode ||
> +		    !ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) {
> +			iput(inode);
> +			continue;
> +		}
> +		raw_inode = (struct ext4_inode *) buf;
> +		ei = EXT4_I(inode);
> +
> +		smp_mb__before_spinlock();
> +		spin_lock(&ei->i_raw_lock);
> +		ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_TIME);
> +		EXT4_INODE_GET_XTIME(i_ctime, inode, raw_inode);
> +		EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode);
> +		EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
> +		EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
> +		ext4_inode_csum_set(inode, raw_inode, ei);
> +		spin_unlock(&ei->i_raw_lock);
> +		iput(inode);
> +	}
> +}
> +
> +/*
>   * Post the struct inode info into an on-disk inode location in the
>   * buffer-cache.  This gobbles the caller's reference to the
>   * buffer_head in the inode location struct.
> @@ -4182,7 +4222,9 @@ static int ext4_do_update_inode(handle_t *handle,
>  	uid_t i_uid;
>  	gid_t i_gid;
>  
> +	smp_mb__before_spinlock();
>  	spin_lock(&ei->i_raw_lock);
> +	ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_TIME);
>  
>  	/* For fields not tracked in the in-memory inode,
>  	 * initialise them to zero for new inodes. */
> @@ -4273,8 +4315,8 @@ static int ext4_do_update_inode(handle_t *handle,
>  	}
>  
>  	ext4_inode_csum_set(inode, raw_inode, ei);
> -
>  	spin_unlock(&ei->i_raw_lock);
> +	ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
>  
>  	BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>  	rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> @@ -4622,6 +4664,24 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
>  	return 0;
>  }
>  
> +int ext4_update_time(struct inode *inode, struct timespec *time, int flags)
> +{
> +	if (flags & S_ATIME)
> +		inode->i_atime = *time;
> +	if (flags & S_VERSION)
> +		inode_inc_iversion(inode);
> +	if (flags & S_CTIME)
> +		inode->i_ctime = *time;
> +	if (flags & S_MTIME)
> +		inode->i_mtime = *time;
> +	if (test_opt(inode->i_sb, LAZYTIME)) {
> +		smp_wmb();
> +		ext4_set_inode_state(inode, EXT4_STATE_DIRTY_TIME);
Since we want update all in-memory data we also have to explicitly update inode->i_version
Which was previously updated implicitly here:
mark_inode_dirty_sync()
->__mark_inode_dirty
  ->ext4_dirty_inode
    ->ext4_mark_inode_dirty
      ->ext4_mark_iloc_dirty
        ->inode_inc_iversion(inode);
                
> +	} else
> +		mark_inode_dirty_sync(inode);
> +	return 0;
> +}

> +
>  static int ext4_index_trans_blocks(struct inode *inode, int lblocks,
>  				   int pextents)
>  {
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4262118..f782040 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3532,6 +3532,7 @@ const struct inode_operations ext4_dir_inode_operations = {
>  	.tmpfile	= ext4_tmpfile,
>  	.rename2	= ext4_rename2,
>  	.setattr	= ext4_setattr,
> +	.update_time	= ext4_update_time,
>  	.setxattr	= generic_setxattr,
>  	.getxattr	= generic_getxattr,
>  	.listxattr	= ext4_listxattr,
> @@ -3545,6 +3546,7 @@ const struct inode_operations ext4_special_inode_operations = {
>  	.setattr	= ext4_setattr,
>  	.setxattr	= generic_setxattr,
>  	.getxattr	= generic_getxattr,
> +	.update_time	= ext4_update_time,
>  	.listxattr	= ext4_listxattr,
>  	.removexattr	= generic_removexattr,
>  	.get_acl	= ext4_get_acl,
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 2c9e686..16c9983 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -910,6 +910,14 @@ static int ext4_drop_inode(struct inode *inode)
>  	int drop = generic_drop_inode(inode);
>  
>  	trace_ext4_drop_inode(inode, drop);
> +	if (!drop && ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) {
> +		atomic_inc(&inode->i_count);
> +		spin_unlock(&inode->i_lock);
> +		ext4_dirty_inode(inode, 0);
> +		spin_lock(&inode->i_lock);
> +		if (atomic_dec_and_test(&inode->i_count))
> +			drop = generic_drop_inode(inode);
> +	}
>  	return drop;
>  }
>  
> @@ -1142,6 +1150,7 @@ enum {
>  	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
>  	Opt_usrquota, Opt_grpquota, Opt_i_version,
>  	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
> +	Opt_lazytime, Opt_nolazytime,
>  	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
>  	Opt_inode_readahead_blks, Opt_journal_ioprio,
>  	Opt_dioread_nolock, Opt_dioread_lock,
> @@ -1204,6 +1213,8 @@ static const match_table_t tokens = {
>  	{Opt_i_version, "i_version"},
>  	{Opt_stripe, "stripe=%u"},
>  	{Opt_delalloc, "delalloc"},
> +	{Opt_lazytime, "lazytime"},
> +	{Opt_nolazytime, "nolazytime"},
>  	{Opt_nodelalloc, "nodelalloc"},
>  	{Opt_removed, "mblk_io_submit"},
>  	{Opt_removed, "nomblk_io_submit"},
> @@ -1361,6 +1372,8 @@ static const struct mount_opts {
>  	 MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
>  	{Opt_nodelalloc, EXT4_MOUNT_DELALLOC,
>  	 MOPT_EXT4_ONLY | MOPT_CLEAR},
> +	{Opt_lazytime, EXT4_MOUNT_LAZYTIME, MOPT_SET},
> +	{Opt_nolazytime, EXT4_MOUNT_LAZYTIME, MOPT_CLEAR},
>  	{Opt_journal_checksum, EXT4_MOUNT_JOURNAL_CHECKSUM,
>  	 MOPT_EXT4_ONLY | MOPT_SET},
>  	{Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
> @@ -3514,6 +3527,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	/* Set defaults before we parse the mount options */
>  	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
> +	set_opt(sb, LAZYTIME);
>  	set_opt(sb, INIT_INODE_TABLE);
>  	if (def_mount_opts & EXT4_DEFM_DEBUG)
>  		set_opt(sb, DEBUG);
> diff --git a/fs/ext4/symlink.c b/fs/ext4/symlink.c
> index ff37119..7c92b93 100644
> --- a/fs/ext4/symlink.c
> +++ b/fs/ext4/symlink.c
> @@ -35,6 +35,7 @@ const struct inode_operations ext4_symlink_inode_operations = {
>  	.follow_link	= page_follow_link_light,
>  	.put_link	= page_put_link,
>  	.setattr	= ext4_setattr,
> +	.update_time	= ext4_update_time,
>  	.setxattr	= generic_setxattr,
>  	.getxattr	= generic_getxattr,
>  	.listxattr	= ext4_listxattr,
> @@ -45,6 +46,7 @@ const struct inode_operations ext4_fast_symlink_inode_operations = {
>  	.readlink	= generic_readlink,
>  	.follow_link	= ext4_follow_link,
>  	.setattr	= ext4_setattr,
> +	.update_time	= ext4_update_time,
>  	.setxattr	= generic_setxattr,
>  	.getxattr	= generic_getxattr,
>  	.listxattr	= ext4_listxattr,
> diff --git a/fs/inode.c b/fs/inode.c
> index 26753ba..cde073a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1280,6 +1280,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
>  }
>  EXPORT_SYMBOL(ilookup);
>  
> +/**
> + * find_active_inode_nowait - find an active inode in the inode cache
> + * @sb:		super block of file system to search
> + * @ino:	inode number to search for
> + *
> + * Search for an active inode @ino in the inode cache, and if the
> + * inode is in the cache, the inode is returned with an incremented
> + * reference count.  If the inode is being freed or is newly
> + * initialized, return nothing instead of trying to wait for the inode
> + * initialization or destruction to be complete.
> + */
> +struct inode *find_active_inode_nowait(struct super_block *sb,
> +				       unsigned long ino)
> +{
> +	struct hlist_head *head = inode_hashtable + hash(sb, ino);
> +	struct inode *inode, *ret_inode = NULL;
> +
> +	spin_lock(&inode_hash_lock);
> +	hlist_for_each_entry(inode, head, i_hash) {
> +		if ((inode->i_ino != ino) ||
> +		    (inode->i_sb != sb))
> +			continue;
> +		spin_lock(&inode->i_lock);
> +		if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
> +			__iget(inode);
> +			ret_inode = inode;
> +		}
> +		spin_unlock(&inode->i_lock);
> +		goto out;
> +	}
> +out:
> +	spin_unlock(&inode_hash_lock);
> +	return ret_inode;
> +}
> +EXPORT_SYMBOL(find_active_inode_nowait);
> +
>  int insert_inode_locked(struct inode *inode)
>  {
>  	struct super_block *sb = inode->i_sb;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ab779e..b5e6b6b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2410,6 +2410,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
>  
>  extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
>  extern struct inode * iget_locked(struct super_block *, unsigned long);
> +extern struct inode *find_active_inode_nowait(struct super_block *,
> +					      unsigned long);
>  extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
>  extern int insert_inode_locked(struct inode *);
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> -- 
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Download attachment "signature.asc" of type "application/pgp-signature" (473 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ