linux-ext4 - Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZTWfX3CqPy9yCddQ@dread.disaster.area>
Date:   Mon, 23 Oct 2023 09:17:03 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Jeff Layton <jlayton@...nel.org>
Cc:     Kent Overstreet <kent.overstreet@...ux.dev>,
        Christian Brauner <brauner@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        Chandan Babu R <chandan.babu@...cle.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Amir Goldstein <amir73il@...il.com>, Jan Kara <jack@...e.de>,
        David Howells <dhowells@...hat.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-btrfs@...r.kernel.org, linux-mm@...ck.org,
        linux-nfs@...r.kernel.org
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain
 timestamp handing

On Fri, Oct 20, 2023 at 08:12:45AM -0400, Jeff Layton wrote:
> On Fri, 2023-10-20 at 09:02 +1100, Dave Chinner wrote:
> > On Thu, Oct 19, 2023 at 07:28:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2023-10-19 at 11:29 +0200, Christian Brauner wrote:
> > > > > Back to your earlier point though:
> > > > > 
> > > > > Is a global offset really a non-starter? I can see about doing something
> > > > > per-superblock, but ktime_get_mg_coarse_ts64 should be roughly as cheap
> > > > > as ktime_get_coarse_ts64. I don't see the downside there for the non-
> > > > > multigrain filesystems to call that.
> > > > 
> > > > I have to say that this doesn't excite me. This whole thing feels a bit
> > > > hackish. I think that a change version is the way more sane way to go.
> > > > 
> > > 
> > > What is it about this set that feels so much more hackish to you? Most
> > > of this set is pretty similar to what we had to revert. Is it just the
> > > timekeeper changes? Why do you feel those are a problem?
> > > 
> > > > > 
> > > > > On another note: maybe I need to put this behind a Kconfig option
> > > > > initially too?
> > > > 
> > > > So can we for a second consider not introducing fine-grained timestamps
> > > > at all. We let NFSv3 live with the cache problem it's been living with
> > > > forever.
> > > > 
> > > > And for NFSv4 we actually do introduce a proper i_version for all
> > > > filesystems that matter to it.
> > > > 
> > > > What filesystems exactly don't expose a proper i_version and what does
> > > > prevent them from adding one or fixing it?
> > > 
> > > Certainly we can drop this series altogether if that's the consensus.
> > > 
> > > The main exportable filesystem that doesn't have a suitable change
> > > counter now is XFS. Fixing it will require an on-disk format change to
> > > accommodate a new version counter that doesn't increment on atime
> > > updates. This is something the XFS folks were specifically looking to
> > > avoid, but maybe that's the simpler option.
> > 
> > And now we have travelled the full circle.
> > 
> 
> LOL, yes!
> 
> > The problem NFS has with atime updates on XFS is a result of
> > the default behaviour of relatime - it *always* forces a persistent
> > atime update after mtime has changed. Hence a read-after-write
> > operation will trigger an atime update because atime is older than
> > mtime. This is what causes XFS to run a transaction (i.e. a
> > persistent atime update) and that bumps iversion.
> > 
> 
> Those particular atime updates are not a problem. If we're updating the
> mtime and ctime anyway, then bumping the change attribute is OK.
> 
> The problem is that relatime _also_ does an on-disk update once a day
> for just an atime update. On XFS, this means that the change attribute
> also gets bumped and the clients invalidate their caches all at once.
> 
> That doesn't sound like a big problem at first, but what if you're
> sharing a multi-gigabyte r/o file between multiple clients? This sort of
> thing is fairly common on render-farm workloads, and all of your clients
> will end up invalidating their caches once once a day if you're serving
> from XFS.

So we have noatime inode and mount options for such specialised
workloads that cannot tolerate cached ever being invalidated, yes?

> > lazytime does not behave this way - it delays all persistent
> > timestamp updates until the next persistent change or until the
> > lazytime aggregation period expires (24 hours). Hence with lazytime,
> > read-after-write operations do not trigger a persistent atime
> > update, and so XFS does not run a transaction to update atime. Hence
> > i_version does not get bumped, and NFS behaves as expected.
> > 
> 
> Similar problem here. Once a day, NFS clients will invalidate the cache
> on any static content served from XFS.

Lazytime has /proc/sys/vm/dirtytime_expire_seconds to change the
interval that triggers persistent time changes. That could easily be
configured to be longer than a day for workloads that care about
this sort of thing. Indeed, we could just set up timestamps that NFS
says "do not make persistent" to only be persisted when the inode is
removed from server memory rather than be timed out by background
writeback....

-----

All I'm suggesting is that rather than using mount options for
noatime-like behaviour for NFSD accesses, we actually have the nfsd
accesses say "we'd like pure atime updates without iversion, please".

Keep in mind that XFS does actually try to avoid bumping i_version
on pure timestamp updates - we carved that out a long time ago (see
the difference in XFS_ILOG_CORE vs XFS_ILOG_TIMESTAMP in
xfs_vn_update_time() and xfs_trans_log_inode()) so that we could
optimise fdatasync() to ignore timestamp updates that occur as a
result of pure data overwrites.

Hence XFS only bumps i_version for pure timestamp updates if the
iversion queried flag is set. IOWs, XFS it is actually doing exactly
what the VFS iversion implementation is telling it to do with
timestamp updates for non-core inode metadata updates.

That's the fundamental issue here: nfsd has set VFS state that tells
the filesystem to "bump iversion on next persistent inode change",
but the nfsd then runs operations that can change non-critical
persistent inode state in "query-only" operations. It then expects
filesystems to know that it should ignore the iversion queried state
within this context.  However, without external behavioural control
flags, filesystems cannot know that an isolated metadata update has
context specific iversion behavioural constraints.

Hence fixing this is purely a VFS/nfsd i_version implementation
problem - if the nfsd is running a querying operation, it should
tell the filesystem that it should ignore iversion query state. If
nothing the application level cache cares about is being changed
during the query operation, it should tell the filesystem to ignore
iversion query state because it is likely the nfsd query itself will
set it (or have already set it itself in the case of compound
operations).

This does not need XFS on-disk format changes to fix. This does not
need changes to timestamp infrastructure to fix. We just need the
nfsd application to tell us that we should ignore the vfs i_version
query state when we update non-core inode metadata within query
operation contexts.

-Dave.
-- 
Dave Chinner
david@...morbit.com