[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ae88800184f03b152aba6e4a95ebf26e854dd63.camel@hammerspace.com>
Date: Wed, 1 Nov 2023 21:34:57 +0000
From: Trond Myklebust <trondmy@...merspace.com>
To: "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
"jack@...e.cz" <jack@...e.cz>
CC: "clm@...com" <clm@...com>,
"josef@...icpanda.com" <josef@...icpanda.com>,
"jstultz@...gle.com" <jstultz@...gle.com>,
"djwong@...nel.org" <djwong@...nel.org>,
"brauner@...nel.org" <brauner@...nel.org>,
"chandan.babu@...cle.com" <chandan.babu@...cle.com>,
"hughd@...gle.com" <hughd@...gle.com>,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
"david@...morbit.com" <david@...morbit.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"dsterba@...e.com" <dsterba@...e.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"jlayton@...nel.org" <jlayton@...nel.org>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
"tytso@....edu" <tytso@....edu>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"amir73il@...il.com" <amir73il@...il.com>,
"linux-btrfs@...r.kernel.org" <linux-btrfs@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"adilger.kernel@...ger.ca" <adilger.kernel@...ger.ca>,
"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>,
"sboyd@...nel.org" <sboyd@...nel.org>,
"dhowells@...hat.com" <dhowells@...hat.com>,
"jack@...e.de" <jack@...e.de>
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain
timestamp handing
On Wed, 2023-11-01 at 10:10 -1000, Linus Torvalds wrote:
> On Wed, 1 Nov 2023 at 00:16, Jan Kara <jack@...e.cz> wrote:
> >
> > OK, but is this compatible with the current XFS behavior? AFAICS
> > currently
> > XFS sets sb->s_time_gran to 1 so timestamps currently stored on
> > disk will
> > have some mostly random garbage in low bits of the ctime.
>
> I really *really* don't think we can use ctime as a "i_version"
> replacement. The whole fine-granularity patches were well-
> intentioned,
> but I do think they were broken.
>
> Note that we can't use ctime as a "i_version" replacement for other
> reasons too - you have filesystems like FAT - which people do want to
> export - that have a single-second (or is it 2s?) granularity in
> reality, even though they report a 1ns value in s_time_gran.
>
> But here's a suggestion that people may hate, but that might just
> work
> in practice:
>
> - get rid of i_version entirely
>
> - use the "known good" part of ctime as the upper bits of the change
> counter (and by "known good" I mean tv_sec - or possibly even "tv_sec
> / 2" if that dim FAT memory of mine is right)
>
> - make the rule be that ctime is *never* updated for atime updates
> (maybe that's already true, I didn't check - maybe it needs a new
> mount flag for nfsd)
>
> - have a per-inode in-memory and vfs-internal (entirely invisible to
> filesystems) "ctime modification counter" that is *NOT* a timestamp,
> and is *NOT* i_version
>
> - make the rule be that the "ctime modification counter" is always
> zero, *EXCEPT* if
> (a) I_VERSION_QUERIED is set
> AND
> (b) the ctime modification doesn't modify the "known good" part
> of ctime
>
> so how the "statx change cookie" ends up being "high bits tv_sec of
> ctime, low bits ctime modification cookie", and the end result of
> that
> is:
>
> - if all the reads happen after the last write (common case), then
> the low bits will be zero, because I_VERSION_QUERIED wasn't set when
> ctime was modified
>
> - if you do a write *after* a modification, the ctime cookie is
> guaranteed to change, because either the known good (sec/2sec) part
> of
> ctime is new, *or* the counter gets updated
>
> - if the nfs server reboots, the in-memory counter will be cleared
> again, and so the change cookie will cause client cache
> invalidations,
> but *only* for those "ctime changed in the same second _after_
> somebody did a read".
>
> - any long-time caches of files that don't get modified are all
> fine,
> because they will have those low bits zero and depend on just the
> stable part of ctime that works across filesystems. So there should
> be
> no nasty thundering herd issues on long-lived caches on lots of
> clients if the server reboots, or atime updates every 24 hours or
> anything like that.
>
> and note that *NONE* of this requires any filesystem involvement
> (except for the rule of "no atime changes ever impact ctime", which
> may or may not already be true).
>
> The filesystem does *not* know about that modification counter,
> there's no new on-disk stable information.
>
> It's entirely possible that I'm missing something obvious, but the
> above sounds to me like the only time you'd have stale invalidations
> is really the (unusual) case of having writes after cached reads, and
> then a reboot.
>
> We'd get rid of "inode_maybe_inc_iversion()" entirely, and instead
> replace it with logic in inode_set_ctime_current() that basically
> does
>
> - if the stable part of ctime changes, clear the new 32-bit counter
>
> - if I_VERSION_QUERIED isn't set, clear the new 32-bit counter
>
> - otherwise, increment the new 32-bit counter
>
> and then the STATX_CHANGE_COOKIE code basically just returns
>
> (stable part of ctime << 32) + new 32-bit counter
>
> (and again, the "stable part of ctime" is either just tv_sec, or it's
> "tv_sec >> 1" or whatever).
>
> The above does not expose *any* changes to timestamps to users, and
> should work across a wide variety of filesystems, without requiring
> any special code from the filesystem itself.
>
> And now please all jump on me and say "No, Linus, that won't work,
> because XYZ".
>
> Because it is *entirely* possible that I missed something truly
> fundamental, and the above is completely broken for some obvious
> reason that I just didn't think of.
>
My client writes to the file and immediately reads the ctime. A 3rd
party client then writes immediately after my ctime read.
A reboot occurs (maybe minutes later), then I re-read the ctime, and
get the same value as before the 3rd party write.
Yes, most of the time that is better than the naked ctime, but not
across a reboot.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@...merspace.com
Powered by blists - more mailing lists