linux-ext4 - Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <01ecb2b7876a4562570658b92f4c12cbc7ad2518.camel@kernel.org>
Date:   Tue, 24 Oct 2023 10:24:24 -0400
From:   Jeff Layton <jlayton@...nel.org>
To:     Dave Chinner <david@...morbit.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Kent Overstreet <kent.overstreet@...ux.dev>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        Chandan Babu R <chandan.babu@...cle.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Amir Goldstein <amir73il@...il.com>, Jan Kara <jack@...e.de>,
        David Howells <dhowells@...hat.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-btrfs@...r.kernel.org, linux-mm@...ck.org,
        linux-nfs@...r.kernel.org
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain
 timestamp handing

On Tue, 2023-10-24 at 14:40 +1100, Dave Chinner wrote:
> On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <david@...morbit.com> wrote:
> > > 
> > > The problem is the first read request after a modification has been
> > > made. That is causing relatime to see mtime > atime and triggering
> > > an atime update. XFS sees this, does an atime update, and in
> > > committing that persistent inode metadata update, it calls
> > > inode_maybe_inc_iversion(force = false) to check if an iversion
> > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > > i_version and tells XFS to persist it.
> > 
> > Could we perhaps just have a mode where we don't increment i_version
> > for just atime updates?
> > 
> > Maybe we don't even need a mode, and could just decide that atime
> > updates aren't i_version updates at all?
> 
> We do that already - in memory atime updates don't bump i_version at
> all. The issue is the rare persistent atime update requests that
> still happen - they are the ones that trigger an i_version bump on
> XFS, and one of the relatime heuristics tickle this specific issue.
> 
> If we push the problematic persistent atime updates to be in-memory
> updates only, then the whole problem with i_version goes away....
> 

POSIX (more or less) states that if you're updating the inode for any
reason other than an atime update, then you need to update the ctime.
The NFSv4 spec (more or less) defines the change attribute as something
that should show a change any time the ctime would change.

Now, from mount(8) manpage, in the section on lazytime:

           The on-disk timestamps are updated only when:

           •   the inode needs to be updated for some change unrelated to file timestamps

           •   the application employs fsync(2), syncfs(2), or sync(2)

           •   an undeleted inode is evicted from memory

           •   more than 24 hours have passed since the inode was written to disk.


The first is not a problem for NFS. If we're updating the ctime or mtime
or other attributes, then we _need_ to bump the change attribute
(assuming that something has queried it and can tell a difference).

The second is potentially an issue. If a file has an in-memory atime
update and them someone calls sync(2), XFS will end up bumping the
change attribute.

Ditto for the third. If someone does a getattr and brings an inode back
into core, then you're looking at a cache invalidation on the client.

The fourth is also a problem, once a day your clients will end up
invaliding their caches.

You might think "so what? Once a day is no big deal!", but there are
places that have built up cascading fscache setups across WANs to
distribute large, read-mostly files. This is quite problematic in these
sorts of setups as they end up seeing cache invalidations all over the
place.

noatime is a workaround, but it has its own problems and ugliness, and
it sucks that XFS doesn't "just work" when served by NFS.


> > Yes, yes, it's obviously technically a "inode modification", but does
> > anybody actually *want* atime updates with no actual other changes to
> > be version events?
> 
> Well, yes, there was. That's why we defined i_version in the on disk
> format this way well over a decade ago. It was part of some deep
> dark magical HSM beans that allowed the application to combine
> multiple scans for different inode metadata changes into a single
> pass. atime changes was one of the things it needed to know about
> for tiering and space scavenging purposes....
> 

As Amir points out, is this still behavior that you're required to
preserve? NFS serving is a bit more common than weird HSM stuff. 

Maybe newly created XFS filesystems could be made to update i_version in
a way that nfsd expects? We could add a mkfs.xfs option to allow for the
legacy behavior if required.

> > Or maybe i_version can update, but callers of getattr() could have two
> > bits for that STATX_CHANGE_COOKIE, one for "I care about atime" and
> > one for others, and we'd pass that down to inode_query_version, and
> > we'd have a I_VERSION_QUERIED and a I_VERSION_QUERIED_STRICT, and the
> > "I care about atime" case ould set the strict one.
> 
> This makes correct behaviour reliant on the applicaiton using the
> query mechanism correctly. I have my doubts that userspace
> developers will be able to understand the subtle difference between
> the two options and always choose correctly....
> 
> And then there's always the issue that we might end up with both
> flags set and we get conflicting bug reports about how atime is not
> behaving the way the applications want it to behave.
> 
> > Then inode_maybe_inc_iversion() could - for atome updates - skip the
> > version update *unless* it sees that I_VERSION_QUERIED_STRICT bit.
> > 
> > Does that sound sane to people?
> 
> I'd much prefer we just do the right thing transparently at the
> filesystem level; all we need is for the inode to be flagged that it
> should be doing in memory atime updates rather than persistent
> updates.
> 
> Perhaps the nfs server should just set a new S_LAZYTIME flag on
> inodes it accesses similar to how we can set S_NOATIME on inodes to
> elide atime updates altogether. Once set, the inode will behave that
> way until it is reclaimed from memory....
> 


Yeah, I think adding this sort of extra complexity on the query side is
probably not what's needed.

I'm also not crazy about trying to treat nfsd's accesses as special in
some way. nfsd is really not doing anything special at all, other than
querying for STATX_CHANGE_COOKIE. 

The problem is on the update side: nfsd expects the STATX_CHANGE_COOKIE
to conform to certain behavior, and XFS simply does not (specifically
around updates to the inode that only change the atime).

-- 
Jeff Layton <jlayton@...nel.org>