[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141125015239.GD27262@dastard>
Date: Tue, 25 Nov 2014 12:52:39 +1100
From: Dave Chinner <david@...morbit.com>
To: Theodore Ts'o <tytso@....edu>
Cc: linux-fsdevel@...r.kernel.org,
Ext4 Developers List <linux-ext4@...r.kernel.org>,
linux-btrfs@...r.kernel.org, xfs@....sgi.com
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option
On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote:
> Add a new mount option which enables a new "lazytime" mode. This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode. The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
>
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
>
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table. The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives. Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso@....edu>
> ---
> fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++-
> fs/inode.c | 18 ++++++++++++++++++
> fs/proc_namespace.c | 1 +
> fs/sync.c | 7 +++++++
> include/linux/fs.h | 1 +
> include/uapi/linux/fs.h | 1 +
> 6 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ef9bef1..ce7de22 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> inode->i_state &= ~I_DIRTY_PAGES;
> dirty = inode->i_state & I_DIRTY;
> - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
> spin_unlock(&inode->i_lock);
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
> iput(old_inode);
> }
>
> +/*
> + * This works like wait_sb_inodes(), but it is called *before* we kick
> + * the bdi so the inodes can get written out.
> + */
> +static void flush_sb_dirty_time(struct super_block *sb)
> +{
> + struct inode *inode, *old_inode = NULL;
> +
> + WARN_ON(!rwsem_is_locked(&sb->s_umount));
> + spin_lock(&inode_sb_list_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + int dirty_time;
> +
> + spin_lock(&inode->i_lock);
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> + dirty_time = inode->i_state & I_DIRTY_TIME;
> + __iget(inode);
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&inode_sb_list_lock);
> +
> + iput(old_inode);
> + old_inode = inode;
> +
> + if (dirty_time)
> + mark_inode_dirty(inode);
> + cond_resched();
> + spin_lock(&inode_sb_list_lock);
> + }
> + spin_unlock(&inode_sb_list_lock);
> + iput(old_inode);
> +}
This just seems wrong to me, not to mention extremely expensive when we have
millions of cached inodes on the superblock.
Why can't we just add a function like mark_inode_dirty_time() which
puts the inode on the dirty inode list with a writeback time 24
hours in the future rather than 30s in the future?
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -534,6 +534,18 @@ static void evict(struct inode *inode)
> BUG_ON(!(inode->i_state & I_FREEING));
> BUG_ON(!list_empty(&inode->i_lru));
>
> + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
> + if (inode->i_op->write_time)
> + inode->i_op->write_time(inode);
> + else if (inode->i_sb->s_op->write_inode) {
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + };
> + mark_inode_dirty(inode);
> + inode->i_sb->s_op->write_inode(inode, &wbc);
> + }
> + }
> +
Eviction is too late for this. I'm pretty sure that it won't get
this far as iput_final() should catch the I_DIRTY_TIME in the !drop
case via write_inode_now().
> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + struct inode *inode = file->f_mapping->host;
> +
> if (!file->f_op->fsync)
> return -EINVAL;
> + if (!datasync && inode->i_state & I_DIRTY_TIME) {
FWIW, I'm surprised gcc isn't throwing warnings about that. Please
make sure there isn't any ambiguity in the code by writing those
checks like this:
if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_SYNC;
> + spin_unlock(&inode->i_lock);
> + }
> return file->f_op->fsync(file, start, end, datasync);
When we mark the inode I_DIRTY_TIME, we should also be marking it
I_DIRTY_SYNC so that all the sync operations know that they should
be writing this inode. That's partly why I also think these inodes
should be tracked on the dirty inode list....
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3633239..489b2f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,6 +1721,7 @@ struct super_operations {
> #define __I_DIO_WAKEUP 9
> #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
> #define I_LINKABLE (1 << 10)
> +#define I_DIRTY_TIME (1 << 11)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
Shouldn't I_DIRTY also include I_DIRTY_TIME now?
--
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists