lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 22 Sep 2023 14:22:52 -0400
From:   Jeff Layton <jlayton@...nel.org>
To:     Kent Overstreet <kent.overstreet@...ux.dev>
Cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>,
        Chuck Lever <chuck.lever@...cle.com>,
        Neil Brown <neilb@...e.de>,
        Olga Kornievskaia <kolga@...app.com>,
        Dai Ngo <Dai.Ngo@...cle.com>, Tom Talpey <tom@...pey.com>,
        Chandan Babu R <chandan.babu@...cle.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Dave Chinner <david@...morbit.com>, Jan Kara <jack@...e.cz>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-nfs@...r.kernel.org, linux-xfs@...r.kernel.org
Subject: Re: [PATCH v8 1/5] fs: add infrastructure for multigrain timestamps

On Fri, 2023-09-22 at 13:31 -0400, Kent Overstreet wrote:
> On Fri, Sep 22, 2023 at 01:14:40PM -0400, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamps when updating the ctime
> > and mtime after a change. This has the benefit of allowing filesystems
> > to optimize away a lot metadata updates, down to around 1 per jiffy,
> > even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFS, which traditionally relied on timestamps to validate caches. A lot
> > of changes can happen in a jiffy, and that can lead to cache-coherency
> > issues between hosts.
> > 
> > NFSv4 added a dedicated change attribute that must change value after
> > any change to an inode. Some filesystems (btrfs, ext4 and tmpfs) utilize
> > the i_version field for this, but the NFSv4 spec allows a server to
> > generate this value from the inode's ctime.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> > 
> > POSIX generally mandates that when the the mtime changes, the ctime must
> > also change. The kernel always stores normalized ctime values, so only
> > the first 30 bits of the tv_nsec field are ever used.
> > 
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the mtime or ctime. When this flag is set,
> > on the next mtime or ctime update, the kernel will fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> > 
> > Filesytems can opt into this behavior by setting the FS_MGTIME flag in
> > the fstype. Filesystems that don't set this flag will continue to use
> > coarse-grained timestamps.
> 
> Interesting...
> 
> So in bcachefs, for most inode fields the btree inode is the "master
> copy"; we do inode updates via btree transactions, and then on
> successful transaction commit we update the VFS inode to match.
> 
> (exceptions: i_size, i_blocks)
> 
> I'd been contemplating switching to that model for timestamp updates as
> well, since that would allow us to get rid of our
> super_operations.write_inode method - except we probably wouldn't want
> to do that since it would likely make timestamp updates too expensive.
> 
> And now with your scheme of stashing extra state in timespec, I'm glad
> we didn't.
> 
> Still, timestamp updates are a bit messier than I'd like, would be
> lovely to figure out a way to clean that up - right now we have an
> awkward mix of "sometimes timestamp updates happen in a btree
> transaction first, other times just the VFS inode is updated and marked
> dirty".
> 
> xfs doesn't have .write_inode, so it's probably time to study what it
> does...

A few months ago, we talked briefly and I asked about an i_version
counter for bcachefs. You were going to look into it, and I wasn't sure
if you had implemented one. If you haven't, then this may be a simpler
alternative.

For now, these aren't much good for anything other than faking up a
change attribute for NFSv4, but they should be fine for that and you
wouldn't need to grow your on-disk inode to accommodate them.

Cheers,
-- 
Jeff Layton <jlayton@...nel.org>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ