linux-kernel - Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24005713ad25370d64ab5bd0db0b2e4fcb902c1c.camel@kernel.org>
Date:   Fri, 16 Sep 2022 11:11:34 -0400
From:   Jeff Layton <jlayton@...nel.org>
To:     Theodore Ts'o <tytso@....edu>, NeilBrown <neilb@...e.de>
Cc:     Trond Myklebust <trondmy@...merspace.com>,
        "bfields@...ldses.org" <bfields@...ldses.org>,
        "zohar@...ux.ibm.com" <zohar@...ux.ibm.com>,
        "djwong@...nel.org" <djwong@...nel.org>,
        "brauner@...nel.org" <brauner@...nel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        "linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
        "david@...morbit.com" <david@...morbit.com>,
        "fweimer@...hat.com" <fweimer@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "chuck.lever@...cle.com" <chuck.lever@...cle.com>,
        "linux-man@...r.kernel.org" <linux-man@...r.kernel.org>,
        "linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
        "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
        "jack@...e.cz" <jack@...e.cz>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "xiubli@...hat.com" <xiubli@...hat.com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "adilger.kernel@...ger.ca" <adilger.kernel@...ger.ca>,
        "lczerner@...hat.com" <lczerner@...hat.com>,
        "ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
        "linux-btrfs@...r.kernel.org" <linux-btrfs@...r.kernel.org>
Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new
 STATX_INO_VERSION field

On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > If the answer is that 'all values change', then why store the crash
> > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > you're generating the user-visible change attribute?
> > > > > 
> > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > 
> > I had suggested just hashing the crash counter with the file system's
> > on-disk i_version number, which is essentially what you are suggested.
> > 
> > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > crash, we can do that.
> > > > 
> > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > there wouldn't be more than 4k increments before one of them made it to
> > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > 
> > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > have not have part in calculating the i_version.  It should be entirely
> > > in the filesystem - though support code could be provided if common
> > > patterns exist across filesystems.
> > 
> > Oh, *heck* no.  This parameter is for the NFS implementation to
> > decide, because it's NFS's caching algorithms which are at stake here.
> > 
> > As a the file system maintainer, I had offered to make an on-disk
> > "crash counter" which would get updated when the journal had gotten
> > replayed, in addition to the on-disk i_version number.  This will be
> > available for the Linux implementation of NFSD to use, but that's up
> > to *you* to decide how you want to use them.
> > 
> > I was perfectly happy with hashing the crash counter and the i_version
> > because I had assumed that not *that* much stuff was going to be
> > cached, and so invalidating all of the caches in the unusual case
> > where there was a crash was acceptable.  After all it's a !@#?!@
> > cache.  Caches sometimmes get invalidated.  "That is the order of
> > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > 
> > But if people expect that multiple TB's of data is going to be stored;
> > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > of false negative failures which might cause data corruption might be
> > acceptable tradeoff, hey, that's for the system which is providing
> > caching semantics to determine.
> > 
> > PLEASE don't put this tradeoff on the file system authors; I would
> > much prefer to leave this tradeoff in the hands of the system which is
> > trying to do the caching.
> > 
> 
> Yeah, if we were designing this from scratch, I might agree with leaving
> more up to the filesystem, but the existing users all have pretty much
> the same needs. I'm going to plan to try to keep most of this in the
> common infrastructure defined in iversion.h.
> 
> Ted, for the ext4 crash counter, what wordsize were you thinking? I
> doubt we'll be able to use much more than 32 bits so a larger integer is
> probably not worthwhile. There are several holes in struct super_block
> (at least on x86_64), so adding this field to the generic structure
> needn't grow it.

That said, now that I've taken a swipe at implementing this, I need more
information than just the crash counter. We need to multiply the crash
counter with a reasonable estimate of the maximum number of individual
writes that could occur between an i_version being incremented and that
value making it to the backing store.

IOW, given a write that bumps the i_version to X, how many more write
calls could race in before X makes it to the platter? I took a SWAG and
said 4k in an earlier email, but I don't really have a way to know, and
that could vary wildly with different filesystems and storage.

What I'd like to see is this in struct super_block:

	u32		s_version_offset;

...and then individual filesystems can calculate:

	crash_counter * max_number_of_writes

and put the correct value in there at mount time.

-- 
Jeff Layton <jlayton@...nel.org>