[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YyQdmLpiAMvl5EkU@mit.edu>
Date: Fri, 16 Sep 2022 02:54:16 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: NeilBrown <neilb@...e.de>
Cc: Jeff Layton <jlayton@...nel.org>,
Trond Myklebust <trondmy@...merspace.com>,
"bfields@...ldses.org" <bfields@...ldses.org>,
"zohar@...ux.ibm.com" <zohar@...ux.ibm.com>,
"djwong@...nel.org" <djwong@...nel.org>,
"brauner@...nel.org" <brauner@...nel.org>,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
"david@...morbit.com" <david@...morbit.com>,
"fweimer@...hat.com" <fweimer@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"chuck.lever@...cle.com" <chuck.lever@...cle.com>,
"linux-man@...r.kernel.org" <linux-man@...r.kernel.org>,
"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"jack@...e.cz" <jack@...e.cz>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"xiubli@...hat.com" <xiubli@...hat.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"adilger.kernel@...ger.ca" <adilger.kernel@...ger.ca>,
"lczerner@...hat.com" <lczerner@...hat.com>,
"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
"linux-btrfs@...r.kernel.org" <linux-btrfs@...r.kernel.org>
Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new
STATX_INO_VERSION field
On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > If the answer is that 'all values change', then why store the crash
> > > counter in the inode at all? Why not just add it as an offset when
> > > you're generating the user-visible change attribute?
> > >
> > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
I had suggested just hashing the crash counter with the file system's
on-disk i_version number, which is essentially what you are suggested.
> > Yes, if we plan to ensure that all the change attrs change after a
> > crash, we can do that.
> >
> > So what would make sense for an offset? Maybe 2**12? One would hope that
> > there wouldn't be more than 4k increments before one of them made it to
> > disk. OTOH, maybe that can happen with teeny-tiny writes.
>
> Leave it up the to filesystem to decide. The VFS and/or NFSD should
> have not have part in calculating the i_version. It should be entirely
> in the filesystem - though support code could be provided if common
> patterns exist across filesystems.
Oh, *heck* no. This parameter is for the NFS implementation to
decide, because it's NFS's caching algorithms which are at stake here.
As a the file system maintainer, I had offered to make an on-disk
"crash counter" which would get updated when the journal had gotten
replayed, in addition to the on-disk i_version number. This will be
available for the Linux implementation of NFSD to use, but that's up
to *you* to decide how you want to use them.
I was perfectly happy with hashing the crash counter and the i_version
because I had assumed that not *that* much stuff was going to be
cached, and so invalidating all of the caches in the unusual case
where there was a crash was acceptable. After all it's a !@#?!@
cache. Caches sometimmes get invalidated. "That is the order of
things." (as Ramata'Klan once said in "Rocks and Shoals")
But if people expect that multiple TB's of data is going to be stored;
that cache invalidation is unacceptable; and that a itsy-weeny chance
of false negative failures which might cause data corruption might be
acceptable tradeoff, hey, that's for the system which is providing
caching semantics to determine.
PLEASE don't put this tradeoff on the file system authors; I would
much prefer to leave this tradeoff in the hands of the system which is
trying to do the caching.
- Ted
Powered by blists - more mailing lists