[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zn1/FVS4NrAwEBwz@tissot.1015granger.net>
Date: Thu, 27 Jun 2024 11:02:45 -0400
From: Chuck Lever <chuck.lever@...cle.com>
To: Jeff Layton <jlayton@...nel.org>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Steven Rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Chandan Babu R <chandan.babu@...cle.com>,
"Darrick J. Wong" <djwong@...nel.org>, Theodore Ts'o <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>, Chris Mason <clm@...com>,
Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
Hugh Dickins <hughd@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>, kernel-team@...com,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
linux-ext4@...r.kernel.org, linux-btrfs@...r.kernel.org,
linux-mm@...ck.org, linux-nfs@...r.kernel.org
Subject: Re: [PATCH 04/10] fs: add infrastructure for multigrain timestamps
On Wed, Jun 26, 2024 at 09:00:24PM -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamps when updating the ctime
> and mtime after a change. This has the benefit of allowing filesystems
> to optimize away a lot metadata updates, down to around 1 per jiffy,
> even when a file is under heavy writes.
>
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
>
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
>
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried. Now that the ctime is stored as a ktime_t, we
> can sacrifice the lowest bit in the word to act as a flag marking
> whether the current timestamp has been queried via stat() or the like.
>
> This solves the problem of being able to distinguish the timestamp
> between updates, but introduces a new problem: it's now possible for a
> file being changed to get a fine-grained timestamp and then a file that
> was altered later to get a coarse-grained one that appears older than
> the earlier fine-grained time. To remedy this, keep a global ktime_t
> value that acts as a timestamp floor.
>
> When we go to stamp a file, we first get the latter of the current floor
> value and the current coarse-grained time (call this "now"). If the
> current inode ctime hasn't been queried then we just attempt to stamp it
> with that value using a cmpxchg() operation.
>
> If it has been queried, then first see whether the current coarse time
> appears later than what we have. If it does, then we accept that value.
> If it doesn't, then we get a fine-grained time and try to swap that into
> the global floor. Whether that succeeds or fails, we take the resulting
> floor time and try to swap that into the ctime.
>
> There is still one remaining problem:
>
> All of this works as long as the realtime clock is monotonically
> increasing. If the clock ever jumps backwards, then we could end up in a
> situation where the floor value is "stuck" far in advance of the clock.
>
> To remedy this, sanity check the floor value and if it's more than 6ms
> (~2 jiffies) ahead of the current coarse-grained clock, disregard the
> floor value, and just accept the current coarse-grained clock.
>
> Filesystems opt into this by setting the FS_MGTIME fstype flag. One
> caveat: those that do will always present ctimes that have the lowest
> bit unset, even when the on-disk ctime has it set.
>
> Signed-off-by: Jeff Layton <jlayton@...nel.org>
> ---
> fs/inode.c | 168 +++++++++++++++++++++++++++++++++------
> fs/stat.c | 39 ++++++++-
> include/linux/fs.h | 30 +++++++
> include/trace/events/timestamp.h | 97 ++++++++++++++++++++++
> 4 files changed, 306 insertions(+), 28 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 5d2b0dfe48c3..12790a26102c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
> static struct hlist_head *inode_hashtable __ro_after_init;
> static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>
> +/* Don't send out a ctime lower than this (modulo backward clock jumps). */
> +static __cacheline_aligned_in_smp ktime_t ctime_floor;
This is piece of memory that will be hit pretty hard (and you
obviously recognize that because of the alignment attribute).
Would it be of any benefit to keep a distinct ctime_floor in each
super block instead?
> /*
> * Empty aops. Can be used for the cases where the user does not
> * define any of the address_space operations.
> @@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file)
> }
> EXPORT_SYMBOL(file_remove_privs);
>
> +/*
> + * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the
> + * current floor is >6ms in the future, assume that the clock has jumped
> + * backward.
> + */
> +#define CTIME_FLOOR_MAX_NS 6000000
> +
> +/**
> + * coarse_ctime - return the current coarse-grained time
> + * @floor: current ctime_floor value
> + *
> + * Get the coarse-grained time, and then determine whether to
> + * return it or the current floor value. Returns the later of the
> + * floor and coarse grained time, unless the floor value is too
> + * far into the future. If that happens, assume the clock has jumped
> + * backward, and that the floor should be ignored.
> + */
> +static ktime_t coarse_ctime(ktime_t floor)
> +{
> + ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED;
> +
> + /* If coarse time is already newer, return that */
> + if (ktime_before(floor, now))
> + return now;
> +
> + /* Ensure floor is not _too_ far in the future */
> + if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS))
> + return now;
> +
> + return floor;
> +}
> +
> +/**
> + * current_time - Return FS time (possibly fine-grained)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> + * as having been QUERIED, get a fine-grained timestamp.
> + */
> +struct timespec64 current_time(struct inode *inode)
> +{
> + ktime_t ctime, floor = smp_load_acquire(&ctime_floor);
> + ktime_t now = coarse_ctime(floor);
> + struct timespec64 now_ts = ktime_to_timespec64(now);
> +
> + if (!is_mgtime(inode))
> + goto out;
> +
> + /* If nothing has queried it, then coarse time is fine */
> + ctime = smp_load_acquire(&inode->__i_ctime);
> + if (ctime & I_CTIME_QUERIED) {
> + /*
> + * If there is no apparent change, then
> + * get a fine-grained timestamp.
> + */
> + if ((now | I_CTIME_QUERIED) == ctime) {
> + ktime_get_real_ts64(&now_ts);
> + now_ts.tv_nsec &= ~I_CTIME_QUERIED;
> + }
> + }
> +out:
> + return timestamp_truncate(now_ts, inode);
> +}
> +EXPORT_SYMBOL(current_time);
> +
> static int inode_needs_update_time(struct inode *inode)
> {
> + struct timespec64 now, ts;
> int sync_it = 0;
> - struct timespec64 now = current_time(inode);
> - struct timespec64 ts;
>
> /* First try to exhaust all avenues to not sync */
> if (IS_NOCMTIME(inode))
> return 0;
>
> + now = current_time(inode);
> +
> ts = inode_get_mtime(inode);
> if (!timespec64_equal(&ts, &now))
> - sync_it = S_MTIME;
> + sync_it |= S_MTIME;
>
> ts = inode_get_ctime(inode);
> if (!timespec64_equal(&ts, &now))
> @@ -2485,25 +2554,6 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
> }
> EXPORT_SYMBOL(timestamp_truncate);
>
> -/**
> - * current_time - Return FS time
> - * @inode: inode.
> - *
> - * Return the current time truncated to the time granularity supported by
> - * the fs.
> - *
> - * Note that inode and inode->sb cannot be NULL.
> - * Otherwise, the function warns and returns time without truncation.
> - */
> -struct timespec64 current_time(struct inode *inode)
> -{
> - struct timespec64 now;
> -
> - ktime_get_coarse_real_ts64(&now);
> - return timestamp_truncate(now, inode);
> -}
> -EXPORT_SYMBOL(current_time);
> -
> /**
> * inode_get_ctime - fetch the current ctime from the inode
> * @inode: inode from which to fetch ctime
> @@ -2518,12 +2568,18 @@ struct timespec64 inode_get_ctime(const struct inode *inode)
> {
> ktime_t ctime = inode->__i_ctime;
>
> + if (is_mgtime(inode))
> + ctime &= ~I_CTIME_QUERIED;
> return ktime_to_timespec64(ctime);
> }
> EXPORT_SYMBOL(inode_get_ctime);
>
> struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
> {
> + trace_inode_set_ctime_to_ts(inode, &ts);
> +
> + if (is_mgtime(inode))
> + ts.tv_nsec &= ~I_CTIME_QUERIED;
> inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
> trace_inode_set_ctime_to_ts(inode, &ts);
> return ts;
> @@ -2535,14 +2591,74 @@ EXPORT_SYMBOL(inode_set_ctime_to_ts);
> * @inode: inode
> *
> * Set the inode->i_ctime to the current value for the inode. Returns
> - * the current value that was assigned to i_ctime.
> + * the current value that was assigned to i_ctime. If this is a not
> + * multigrain inode, then we just set it to whatever the coarse time is.
> + *
> + * If it is multigrain, then we first see if the coarse-grained
> + * timestamp is distinct from what we have. If so, then we'll just use
> + * that. If we have to get a fine-grained timestamp, then do so, and
> + * try to swap it into the floor. We accept the new floor value
> + * regardless of the outcome of the cmpxchg. After that, we try to
> + * swap the new value into __i_ctime. Again, we take the resulting
> + * ctime, regardless of the outcome of the swap.
> */
> struct timespec64 inode_set_ctime_current(struct inode *inode)
> {
> - struct timespec64 now = current_time(inode);
> + ktime_t ctime, now, cur, floor = smp_load_acquire(&ctime_floor);
> +
> + now = coarse_ctime(floor);
>
> - inode_set_ctime_to_ts(inode, now);
> - return now;
> + /* Just return that if this is not a multigrain fs */
> + if (!is_mgtime(inode)) {
> + inode->__i_ctime = now;
> + goto out;
> + }
> +
> + /*
> + * We only need a fine-grained time if someone has queried it,
> + * and the current coarse grained time isn't later than what's
> + * already there.
> + */
> + ctime = smp_load_acquire(&inode->__i_ctime);
> + if ((ctime & I_CTIME_QUERIED) && !ktime_after(now, ctime & ~I_CTIME_QUERIED)) {
> + ktime_t old;
> +
> + /* Get a fine-grained time */
> + now = ktime_get_real() & ~I_CTIME_QUERIED;
> +
> + /*
> + * If the cmpxchg works, we take the new floor value. If
> + * not, then that means that someone else changed it after we
> + * fetched it but before we got here. That value is just
> + * as good, so keep it.
> + */
> + old = cmpxchg(&ctime_floor, floor, now);
> + trace_ctime_floor_update(inode, floor, now, old);
> + if (old != floor)
> + now = old;
> + }
> +retry:
> + /* Try to swap the ctime into place. */
> + cur = cmpxchg(&inode->__i_ctime, ctime, now);
> + trace_ctime_inode_update(inode, ctime, now, cur);
> +
> + /* If swap occurred, then we're done */
> + if (cur != ctime) {
> + /*
> + * Was the change due to someone marking the old ctime QUERIED?
> + * If so then retry the swap. This can only happen once since
> + * the only way to clear I_CTIME_QUERIED is to stamp the inode
> + * with a new ctime.
> + */
> + if (!(ctime & I_CTIME_QUERIED) && (ctime | I_CTIME_QUERIED) == cur) {
> + ctime = cur;
> + goto retry;
> + }
> + /* Otherwise, take the new ctime */
> + now = cur & ~I_CTIME_QUERIED;
> + }
> +out:
> + return timestamp_truncate(ktime_to_timespec64(now), inode);
> }
> EXPORT_SYMBOL(inode_set_ctime_current);
>
> diff --git a/fs/stat.c b/fs/stat.c
> index 6f65b3456cad..7e9bd16b553b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -22,10 +22,39 @@
>
> #include <linux/uaccess.h>
> #include <asm/unistd.h>
> +#include <trace/events/timestamp.h>
>
> #include "internal.h"
> #include "mount.h"
>
> +/**
> + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
> + * @stat: where to store the resulting values
> + * @request_mask: STATX_* values requested
> + * @inode: inode from which to grab the c/mtime
> + *
> + * Given @inode, grab the ctime and mtime out if it and store the result
> + * in @stat. When fetching the value, flag it as queried so the next write
> + * will ensure a distinct timestamp.
> + */
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
> +{
> + atomic_long_t *pc = (atomic_long_t *)&inode->__i_ctime;
> +
> + /* If neither time was requested, then don't report them */
> + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
> + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
> + return;
> + }
> +
> + stat->mtime.tv_sec = inode->i_mtime_sec;
> + stat->mtime.tv_nsec = inode->i_mtime_nsec;
> + stat->ctime = ktime_to_timespec64(atomic_long_fetch_or(I_CTIME_QUERIED, pc) &
> + ~I_CTIME_QUERIED);
> + trace_fill_mg_cmtime(inode, atomic_long_read(pc));
> +}
> +EXPORT_SYMBOL(fill_mg_cmtime);
> +
> /**
> * generic_fillattr - Fill in the basic attributes from the inode struct
> * @idmap: idmap of the mount the inode was found from
> @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
> stat->rdev = inode->i_rdev;
> stat->size = i_size_read(inode);
> stat->atime = inode_get_atime(inode);
> - stat->mtime = inode_get_mtime(inode);
> - stat->ctime = inode_get_ctime(inode);
> +
> + if (is_mgtime(inode)) {
> + fill_mg_cmtime(stat, request_mask, inode);
> + } else {
> + stat->ctime = inode_get_ctime(inode);
> + stat->mtime = inode_get_mtime(inode);
> + }
> +
> stat->blksize = i_blocksize(inode);
> stat->blocks = inode->i_blocks;
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4b10db12725d..5694cb6c4dc2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1608,6 +1608,23 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
> return inode_set_mtime_to_ts(inode, ts);
> }
>
> +/*
> + * Multigrain timestamps
> + *
> + * Conditionally use fine-grained ctime and mtime timestamps when there
> + * are users actively observing them via getattr. The primary use-case
> + * for this is NFS clients that use the ctime to distinguish between
> + * different states of the file, and that are often fooled by multiple
> + * operations that occur in the same coarse-grained timer tick.
> + *
> + * We use the least significant bit of the ktime_t to track the QUERIED
> + * flag. This means that filesystems with multigrain timestamps effectively
> + * have 2ns resolution for the ctime, even if they advertise 1ns s_time_gran.
> + */
> +#define I_CTIME_QUERIED (1LL)
> +
> +static inline bool is_mgtime(const struct inode *inode);
> +
> struct timespec64 inode_get_ctime(const struct inode *inode);
> struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts);
>
> @@ -2477,6 +2494,7 @@ struct file_system_type {
> #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
> #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
> +#define FS_MGTIME 64 /* FS uses multigrain timestamps */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> int (*init_fs_context)(struct fs_context *);
> const struct fs_parameter_spec *parameters;
> @@ -2500,6 +2518,17 @@ struct file_system_type {
>
> #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
>
> +/**
> + * is_mgtime: is this inode using multigrain timestamps
> + * @inode: inode to test for multigrain timestamps
> + *
> + * Return true if the inode uses multigrain timestamps, false otherwise.
> + */
> +static inline bool is_mgtime(const struct inode *inode)
> +{
> + return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> +}
> +
> extern struct dentry *mount_bdev(struct file_system_type *fs_type,
> int flags, const char *dev_name, void *data,
> int (*fill_super)(struct super_block *, void *, int));
> @@ -3234,6 +3263,7 @@ extern void page_put_link(void *);
> extern int page_symlink(struct inode *inode, const char *symname, int len);
> extern const struct inode_operations page_symlink_inode_operations;
> extern void kfree_link(void *);
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode);
> void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
> void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
> extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
> diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
> index 35ff875d3800..1f71738aa38c 100644
> --- a/include/trace/events/timestamp.h
> +++ b/include/trace/events/timestamp.h
> @@ -8,6 +8,78 @@
> #include <linux/tracepoint.h>
> #include <linux/fs.h>
>
> +TRACE_EVENT(ctime_floor_update,
> + TP_PROTO(struct inode *inode,
> + ktime_t old,
> + ktime_t new,
> + ktime_t cur),
> +
> + TP_ARGS(inode, old, new, cur),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(ino_t, ino)
> + __field(ktime_t, old)
> + __field(ktime_t, new)
> + __field(ktime_t, cur)
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = inode->i_sb->s_dev;
> + __entry->ino = inode->i_ino;
> + __entry->old = old;
> + __entry->new = new;
> + __entry->cur = cur;
> + ),
> +
> + TP_printk("ino=%d:%d:%lu old=%llu.%lu new=%llu.%lu cur=%llu.%lu swp=%c",
> + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> + ktime_to_timespec64(__entry->old).tv_sec,
> + ktime_to_timespec64(__entry->old).tv_nsec,
> + ktime_to_timespec64(__entry->new).tv_sec,
> + ktime_to_timespec64(__entry->new).tv_nsec,
> + ktime_to_timespec64(__entry->cur).tv_sec,
> + ktime_to_timespec64(__entry->cur).tv_nsec,
> + (__entry->old == __entry->cur) ? 'Y' : 'N'
> + )
> +);
> +
> +TRACE_EVENT(ctime_inode_update,
> + TP_PROTO(struct inode *inode,
> + ktime_t old,
> + ktime_t new,
> + ktime_t cur),
> +
> + TP_ARGS(inode, old, new, cur),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(ino_t, ino)
> + __field(ktime_t, old)
> + __field(ktime_t, new)
> + __field(ktime_t, cur)
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = inode->i_sb->s_dev;
> + __entry->ino = inode->i_ino;
> + __entry->old = old;
> + __entry->new = new;
> + __entry->cur = cur;
> + ),
> +
> + TP_printk("ino=%d:%d:%ld old=%llu.%ld new=%llu.%ld cur=%llu.%ld swp=%c",
> + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> + ktime_to_timespec64(__entry->old).tv_sec,
> + ktime_to_timespec64(__entry->old).tv_nsec,
> + ktime_to_timespec64(__entry->new).tv_sec,
> + ktime_to_timespec64(__entry->new).tv_nsec,
> + ktime_to_timespec64(__entry->cur).tv_sec,
> + ktime_to_timespec64(__entry->cur).tv_nsec,
> + (__entry->old == __entry->cur ? 'Y' : 'N')
> + )
> +);
> +
> TRACE_EVENT(inode_needs_update_time,
> TP_PROTO(struct inode *inode,
> struct timespec64 *now,
> @@ -70,6 +142,31 @@ TRACE_EVENT(inode_set_ctime_to_ts,
> __entry->ts_sec, __entry->ts_nsec
> )
> );
> +
> +TRACE_EVENT(fill_mg_cmtime,
> + TP_PROTO(struct inode *inode,
> + ktime_t ctime),
> +
> + TP_ARGS(inode, ctime),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(ino_t, ino)
> + __field(ktime_t, ctime)
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = inode->i_sb->s_dev;
> + __entry->ino = inode->i_ino;
> + __entry->ctime = ctime;
> + ),
> +
> + TP_printk("ino=%d:%d:%ld ctime=%llu.%lu",
> + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> + ktime_to_timespec64(__entry->ctime).tv_sec,
> + ktime_to_timespec64(__entry->ctime).tv_nsec
> + )
> +);
> #endif /* _TRACE_TIMESTAMP_H */
>
> /* This part must be outside protection */
>
> --
> 2.45.2
>
>
--
Chuck Lever
Powered by blists - more mailing lists