linux-ext4 - Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZUAwFkAizH1PrIZp@dread.disaster.area>
Date:   Tue, 31 Oct 2023 09:37:10 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Jeff Layton <jlayton@...nel.org>
Cc:     Amir Goldstein <amir73il@...il.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Kent Overstreet <kent.overstreet@...ux.dev>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        Chandan Babu R <chandan.babu@...cle.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.de>, David Howells <dhowells@...hat.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-btrfs@...r.kernel.org, linux-mm@...ck.org,
        linux-nfs@...r.kernel.org
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain
 timestamp handing

On Fri, Oct 27, 2023 at 06:35:58AM -0400, Jeff Layton wrote:
> On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote:
> > On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote:
> > > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote:
> > > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > > In earlier discussions you alluded to some repair and/or analysis tools
> > > that depended on this counter.
> > 
> > Yes, and one of those "tools" is *me*.
> > 
> > I frequently look at the di_changecount when doing forensic and/or
> > failure analysis on filesystem corpses.  SOE analysis, relative
> > modification activity, etc all give insight into what happened to
> > the filesystem to get it into the state it is currently in, and
> > di_changecount provides information no other metadata in the inode
> > contains.
> > 
> > > I took a quick look in xfsprogs, but I
> > > didn't see anything there. Is there a library or something that these
> > > tools use to get at this value?
> > 
> > xfs_db is the tool I use for this, such as:
> > 
> > $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/mapper/fast
> > v3.change_count = 35
> > $
> > 
> > The root inode in this filesystem has a change count of 35. The root
> > inode has 32 dirents in it, which means that no entries have ever
> > been removed or renamed. This sort of insight into the past history
> > of inode metadata is largely impossible to get any other way, and
> > it's been the difference between understanding failure and having no
> > clue more than once.
> > 
> > Most block device parsing applications simply write their own
> > decoder that walks the on-disk format. That's pretty trivial to do,
> > developers can get all the information needed to do this from the
> > on-disk format specification documentation we keep on kernel.org...
> > 
> 
> Fair enough. I'm not here to tell you that you guys that you need to
> change how di_changecount works. If it's too valuable to keep it
> counting atime-only updates, then so be it.
> 
> If that's the case however, and given that the multigrain timestamp work
> is effectively dead, then I don't see an alternative to growing the on-
> disk inode. Do you?

Yes, I do see alternatives. That's what I've been trying
(unsuccessfully) to describe and get consensus on. I feel like I'm
being ignored and rail-roaded here, because nobody is even
acknowledging that I'm proposing alternatives and keeps insisting
that the only solution is a change of on-disk format.

So, I'll summarise the situation *yet again* in the hope that this
time I won't get people arguing about atime vs i-version and what
constitutes an on-disk format change because that goes nowhere and
does nothing to determine which solution might be acceptible.

The basic situation is this:

If XFS can ignore relatime or lazytime persistent updates for given
situations, then *we don't need to make periodic on-disk updates of
atime*. This makes the whole problem of "persistent atime update bumps
i_version" go away because then we *aren't making persistent atime
updates* except when some other persistent modification that bumps
[cm]time occurs.

But I don't want to do this unconditionally - for systems not
running anything that samples i_version we want relatime/lazytime
to behave as they are supposed to and do periodic persistent updates
as per normal. Principle of least surprise and all that jazz.

So we really need an indication for inodes that we should enable this
mode for the inode. I have asked if we can have per-operation
context flag to trigger this given the needs for io_uring to have
context flags for timestamp updates to be added. 

I have asked if we can have an inode flag set by the VFS or
application code for this. e.g. a flag set by nfsd whenever it accesses a
given inode.

I have asked if this inode flag can just be triggered if we ever see
I_VERSION_QUERIED set or statx is used to retrieve a change cookie,
and whether this is a reliable mechanism for setting such a flag.

I have suggested mechanisms for using masked off bits of timestamps
to encode sub-timestamp granularity change counts and keep them
invisible to userspace and then not using i_version at all for XFS.
This avoids all the problems that the multi-grain timestamp
infrastructure exposed due to variable granularity of user visible
timestamps and ordering across inodes with different granularity.
This is potentially a general solution, too.

So, yeah, there are *lots* of ways we can solve this problem without
needing to change on-disk formats.

-Dave.
-- 
Dave Chinner
david@...morbit.com