linux-kernel - Re: [PATCH v8 02/12] fs: add infrastructure for multigrain timestamps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7761de29d15df87a29575de57554b56a91ae55a0.camel@kernel.org>
Date: Tue, 01 Oct 2024 09:34:18 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Jan Kara <jack@...e.cz>
Cc: John Stultz <jstultz@...gle.com>, Thomas Gleixner <tglx@...utronix.de>, 
 Stephen Boyd <sboyd@...nel.org>, Alexander Viro <viro@...iv.linux.org.uk>,
 Christian Brauner <brauner@...nel.org>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, Mathieu
 Desnoyers <mathieu.desnoyers@...icios.com>,  Jonathan Corbet
 <corbet@....net>, Randy Dunlap <rdunlap@...radead.org>, Chandan Babu R
 <chandan.babu@...cle.com>, "Darrick J. Wong" <djwong@...nel.org>, Theodore
 Ts'o <tytso@....edu>, Andreas Dilger <adilger.kernel@...ger.ca>, Chris
 Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba
 <dsterba@...e.com>,  Hugh Dickins <hughd@...gle.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Chuck Lever <chuck.lever@...cle.com>, Vadim
 Fedorenko <vadim.fedorenko@...ux.dev>,  linux-kernel@...r.kernel.org,
 linux-fsdevel@...r.kernel.org,  linux-trace-kernel@...r.kernel.org,
 linux-doc@...r.kernel.org,  linux-xfs@...r.kernel.org,
 linux-ext4@...r.kernel.org,  linux-btrfs@...r.kernel.org,
 linux-nfs@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v8 02/12] fs: add infrastructure for multigrain
 timestamps

On Tue, 2024-10-01 at 15:20 +0200, Jan Kara wrote:
> On Tue 01-10-24 06:58:56, Jeff Layton wrote:
> > The VFS has always used coarse-grained timestamps when updating the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. A lot of changes
> > can happen in a jiffy, so timestamps aren't sufficient to help the
> > client decide when to invalidate the cache. Even with NFSv4, a lot of
> > exported filesystems don't properly support a change attribute and are
> > subject to the same problems with timestamp granularity. Other
> > applications have similar issues with timestamps (e.g backup
> > applications).
> > 
> > If we were to always use fine-grained timestamps, that would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem would have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> > as a flag that indicates whether the current timestamps have been
> > queried via stat() or the like. When it's set, we allow the kernel to
> > use a fine-grained timestamp iff it's necessary to make the ctime show
> > a different value.
> > 
> > This solves the problem of being able to distinguish the timestamp
> > between updates, but introduces a new problem: it's now possible for a
> > file being changed to get a fine-grained timestamp. A file that is
> > altered just a bit later can then get a coarse-grained one that appears
> > older than the earlier fine-grained time. This violates timestamp
> > ordering guarantees.
> > 
> > To remedy this, keep a global monotonic atomic64_t value that acts as a
> > timestamp floor.  When we go to stamp a file, we first get the latter of
> > the current floor value and the current coarse-grained time. If the
> > inode ctime hasn't been queried then we just attempt to stamp it with
> > that value.
> > 
> > If it has been queried, then first see whether the current coarse time
> > is later than the existing ctime. If it is, then we accept that value.
> > If it isn't, then we get a fine-grained timestamp.
> > 
> > Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> > Others should be unaffected (other than being subject to the same floor
> > value as multigrain filesystems).
> > 
> > Tested-by: Randy Dunlap <rdunlap@...radead.org> # documentation bits
> > Signed-off-by: Jeff Layton <jlayton@...nel.org>
> 
> Mostly looks good. Some smaller comments below.
> 
> > +/**
> > + * current_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp, but don't update
> > + * the floor.
> > + *
> > + * For a multigrain inode, this is effectively an estimate of the timestamp
> > + * that a file would receive. An actual update must go through
> > + * inode_set_ctime_current().
> > + */
> > +struct timespec64 current_time(struct inode *inode)
> > +{
> > +	struct timespec64 now;
> > +	u32 cns;
> > +
> > +	ktime_get_coarse_real_ts64_mg(&now);
> > +
> > +	if (!is_mgtime(inode))
> > +		goto out;
> > +
> > +	/* If nothing has queried it, then coarse time is fine */
> > +	cns = smp_load_acquire(&inode->i_ctime_nsec);
> > +	if (cns & I_CTIME_QUERIED) {
> > +		/*
> > +		 * If there is no apparent change, then get a fine-grained
> > +		 * timestamp.
> > +		 */
> > +		if (now.tv_nsec == (cns & ~I_CTIME_QUERIED))
> > +			ktime_get_real_ts64(&now);
> > +	}
> > +out:
> > +	return timestamp_truncate(now, inode);
> > +}
> > +EXPORT_SYMBOL(current_time);
> > +
> >  static int inode_needs_update_time(struct inode *inode)
> >  {
> > +	struct timespec64 now, ts;
> >  	int sync_it = 0;
> > -	struct timespec64 now = current_time(inode);
> > -	struct timespec64 ts;
> >  
> >  	/* First try to exhaust all avenues to not sync */
> >  	if (IS_NOCMTIME(inode))
> >  		return 0;
> >  
> > +	now = current_time(inode);
> > +
> >  	ts = inode_get_mtime(inode);
> >  	if (!timespec64_equal(&ts, &now))
> > -		sync_it = S_MTIME;
> > +		sync_it |= S_MTIME;
> >  
> >  	ts = inode_get_ctime(inode);
> >  	if (!timespec64_equal(&ts, &now))
> > @@ -2598,6 +2637,15 @@ void inode_nohighmem(struct inode *inode)
> >  }
> >  EXPORT_SYMBOL(inode_nohighmem);
> >  
> > +struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
> > +{
> > +	set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec);
> > +	inode->i_ctime_sec = ts.tv_sec;
> > +	inode->i_ctime_nsec = ts.tv_nsec;
> > +	return ts;
> > +}
> > +EXPORT_SYMBOL(inode_set_ctime_to_ts);
> > +
> >  /**
> >   * timestamp_truncate - Truncate timespec to a granularity
> >   * @t: Timespec
> > @@ -2630,36 +2678,75 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
> >  EXPORT_SYMBOL(timestamp_truncate);
> >  
> >  /**
> > - * current_time - Return FS time
> > - * @inode: inode.
> > + * inode_set_ctime_current - set the ctime to current_time
> > + * @inode: inode
> >   *
> > - * Return the current time truncated to the time granularity supported by
> > - * the fs.
> > + * Set the inode's ctime to the current value for the inode. Returns the
> > + * current value that was assigned. If this is not a multigrain inode, then we
> > + * set it to the later of the coarse time and floor value.
> >   *
> > - * Note that inode and inode->sb cannot be NULL.
> > - * Otherwise, the function warns and returns time without truncation.
> > + * If it is multigrain, then we first see if the coarse-grained timestamp is
> > + * distinct from what we have. If so, then we'll just use that. If we have to
> > + * get a fine-grained timestamp, then do so, and try to swap it into the floor.
> > + * We accept the new floor value regardless of the outcome of the cmpxchg.
> > + * After that, we try to swap the new value into i_ctime_nsec. Again, we take
> > + * the resulting ctime, regardless of the outcome of the swap.
> 
> This comment seems outdated now. No floor in this function anymore...
> 

True. Will fix.

> > -struct timespec64 current_time(struct inode *inode)
> > +struct timespec64 inode_set_ctime_current(struct inode *inode)
> >  {
> >  	struct timespec64 now;
> > +	u32 cns, cur;
> ...
> 
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 41e598376d7e..381926fb405f 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -26,6 +26,35 @@
> >  #include "internal.h"
> >  #include "mount.h"
> >  
> > +/**
> > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
> > + * @stat: where to store the resulting values
> > + * @request_mask: STATX_* values requested
> > + * @inode: inode from which to grab the c/mtime
> > + *
> > + * Given @inode, grab the ctime and mtime out if it and store the result
> 						 ^^ of
> 
> > + * in @stat. When fetching the value, flag it as QUERIED (if not already)
> > + * so the next write will record a distinct timestamp.
> > + */
> > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
> > +{
> 
> Given how things worked out in the end, it seems this function doesn't need
> to handle mtime at all and we can move mtime handling back to shared generic
> code?
> 

I don't think we can. The mtime is effectively derived from the ctime.

If I query only the mtime, I think it's reasonable to expect that it
will change if there is another write, even if I don't query the ctime.
We won't get that unless we can also set the flag in the ctime when
only the mtime is requested.

> > +	atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec;
> > +
> > +	/* If neither time was requested, then don't report them */
> > +	if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
> > +		stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
> > +		return;
> > +	}
> > +
> > +	stat->mtime = inode_get_mtime(inode);
> > +	stat->ctime.tv_sec = inode->i_ctime_sec;
> > +	stat->ctime.tv_nsec = (u32)atomic_read(pcn);
> > +	if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED))
> > +		stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn));
> > +	stat->ctime.tv_nsec &= ~I_CTIME_QUERIED;
> > +}
> > +EXPORT_SYMBOL(fill_mg_cmtime);
> > +
> >  /**
> >   * generic_fillattr - Fill in the basic attributes from the inode struct
> >   * @idmap:		idmap of the mount the inode was found from
> > @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
> >  	stat->rdev = inode->i_rdev;
> >  	stat->size = i_size_read(inode);
> >  	stat->atime = inode_get_atime(inode);
> > -	stat->mtime = inode_get_mtime(inode);
> > -	stat->ctime = inode_get_ctime(inode);
> > +
> > +	if (is_mgtime(inode)) {
> > +		fill_mg_cmtime(stat, request_mask, inode);
> > +	} else {
> > +		stat->ctime = inode_get_ctime(inode);
> > +		stat->mtime = inode_get_mtime(inode);
> > +	}
> > +
> >  	stat->blksize = i_blocksize(inode);
> >  	stat->blocks = inode->i_blocks;
> >  
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e3c603d01337..23908bad166c 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1653,6 +1653,17 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
> >  	return inode_set_mtime_to_ts(inode, ts);
> >  }
> >  
> > +/*
> > + * Multigrain timestamps
> > + *
> > + * Conditionally use fine-grained ctime and mtime timestamps when there
> > + * are users actively observing them via getattr. The primary use-case
> > + * for this is NFS clients that use the ctime to distinguish between
> > + * different states of the file, and that are often fooled by multiple
> > + * operations that occur in the same coarse-grained timer tick.
> 
> Again, mtime seems unaffected by mgtime changes now.
> 

I still think we need this.
 
> > + */
> > +#define I_CTIME_QUERIED		((u32)BIT(31))
> > +
> 
> 								Honza

-- 
Jeff Layton <jlayton@...nel.org>