[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fe13642a39160cc7dd15f9212e1afb69e955c0be.camel@kernel.org>
Date: Tue, 30 Aug 2022 07:20:50 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Dave Chinner <david@...morbit.com>
Cc: Amir Goldstein <amir73il@...il.com>,
Trond Myklebust <trondmy@...merspace.com>,
"djwong@...nel.org" <djwong@...nel.org>,
"zohar@...ux.ibm.com" <zohar@...ux.ibm.com>,
"brauner@...nel.org" <brauner@...nel.org>,
"xiubli@...hat.com" <xiubli@...hat.com>,
"neilb@...e.de" <neilb@...e.de>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
"dwysocha@...hat.com" <dwysocha@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"chuck.lever@...cle.com" <chuck.lever@...cle.com>,
"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
"tytso@....edu" <tytso@....edu>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"jack@...e.cz" <jack@...e.cz>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"linux-btrfs@...r.kernel.org" <linux-btrfs@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"lczerner@...hat.com" <lczerner@...hat.com>,
"adilger.kernel@...ger.ca" <adilger.kernel@...ger.ca>,
"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>
Subject: Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update
in xfs_vn_update_time
On Tue, 2022-08-30 at 10:08 +1000, Dave Chinner wrote:
> On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> > > >
> > > > The race window ought to be relatively small, and this wouldn't result
> > > > in incorrect behavior that you'd notice (other than loss of
> > > > performance), but it's not ideal. We're doing more on-the-wire reads
> > > > than are necessary in this case.
> > > >
> > > > It would be nice to have it not do that. If we end up taking this patch
> > > > to make it elide the i_version bumps on atime updates, we may be able to
> > > > set the the NOIVER flag in other cases as well, and avoid some of these
> > > > extra bumps.
> > >
> > >
> > > <sigh>
> > >
> > > Please don't make me repeat myself for the third time.
> > >
> > > Once we have decided on a solid, unchanging definition for the
> > > *statx user API variable*, we'll implement a new on-disk field that
> > > provides this information. We will document it in the on-disk
> > > specification as "this is how di_iversion behaves" so that it is
> > > clear to everyone parsing the on-disk format or writing their own
> > > XFS driver how to implement it and when to expect it to
> > > change.
> > >
> > > Then we can add a filesystem and inode feature flags that say "inode
> > > has new iversion" and we use that to populate the kernel iversion
> > > instead of di_changecount. We keep di_changecount exactly the way it
> > > is now for the applications and use cases we already have for that
> > > specific behaviour. If the kernel and/or filesystem don't support
> > > the new di_iversion field, then we'll use di_changecount as it
> > > currently exists for the kernel iversion code.
> > >
> >
> > Aside from NFS and IMA, what applications are dependent on the current
> > definition and how do they rely on i_version today?
>
> I've answered this multiple times already: the di_changecount
> behaviour is defined in the on-disk specification and hence we
> *cannot change the behaviour* without changing the on-disk format
> specification.
>
> Apart from the forensics aspect of the change counter (which nobody
> but us XFS developers seem to understand just how damn important
> this is), there are *many* third party applications that parse the
> XFS on-disk format directly. This:
>
> https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1
>
> Shows grub2, libparted, syslinux, partclone and fsarchiver as
> knowing about XFS on-disk superblock flags that tell them what
> format the directory structure is in. That alone is enough to
> indicate they parse on-disk inodes directly, and hence may expect
> di_changecount to have specific meaning and use it to detect
> unexpected changes to files/directories they care about.
>
> If I go looking for XFS_SB_MAGIC, I find things like libblkid,
> klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the
> on-disk superblocks directly from the block device, too. They also
> rely directly on XFS developers ensuring there are no silent
> incomaptible changes to the on disk format.
>
> I also know of many other utilities that people and companies have
> written that parse the on disk format directly from userspace. The
> functions these perform include low level storage management tools,
> copying and managing disk images (e.g. offline configuration for
> cluster deployments), data recovery tools that scrape all the data
> out of broken filesystems, etc.
>
> These applications are reliant on the guarantee we provide that the
> on-disk format will not silently change and that behaviour/structure
> can always easily be discovered by feature flags in the superblock
> and/or inodes.
>
> IOWs, just because there aren't obvious "traditional" application on
> top of the kernel filesystem that consumes the in-memory kernel
> iversion field, it does not mean that the defined behaviour of the
> on-disk di_changecount field is not used or relied on by other tools
> that work directly on the on-disk format.
>
> You might be right that NFS doesn't care about this, but the point
> remains that NFS does not control the XFS on-disk format, nor does
> the fact that what NFS wants from the change attribute has changed
> over time override the fact that maintaining XFS on-disk format
> compatibility is the responsibility of XFS developers. We're willing
> to change the on-disk format to support whatever the new definition
> of the statx change attribute ends up being, and that should be the
> end of the discussion.
>
Thanks for spelling this out in more detail.
> > > Keep in mind that we've been doing dynamic inode format updates in
> > > XFS for a couple of decades - users don't even have to be aware that
> > > they need to perform format upgrades because often they just happen
> > > whenever an inode is accessed. IOWs, just because we have to change
> > > the on-disk format to support this new iversion definition, it
> > > doesn't mean users have to reformat filesystems before the new
> > > feature can be used.
> > >
> > > Hence, over time, as distros update kernels, the XFS iversion
> > > behaviour will change automagically as we update inodes in existing
> > > filesystems as they are accessed to add and then use the new
> > > di_iversion field for the VFS change attribute field instead of the
> > > di_changecount field...
> > >
> >
> > If you want to create a whole new on-disk field for this, then that's
> > your prerogative, but before you do that, I'd like to better understand
> > why and how the constraints on this field changed.
> >
> > The original log message from the commit that added a change counter
> > (below) stated that you were adding it for network filesystems like NFS.
> > When did this change and why?
>
> It never changed. I'll repeat what I've already explained twice
> before:
>
> https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/
> https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/
>
> tl; dr: NFS requirements were just one of *many* we had at the time
> for an atomic persistent change counter.
>
> The fact is that NFS users are just going to have to put up with
> random cache invalidations on XFS for a while longer. Nobody noticed
> this and/or cared about this enough to raise it as an issue for the
> past decade, so waiting another few months for upstream XFS to
> change to a different on-disk format for the NFS/statx change
> attribute isn't a big deal.
>
Fair enough. I'll plan to drop this patch from the series for now, with
the expectation that you guys will add a new i_version counter that
better conforms to what NFS and IMA need.
Thanks,
--
Jeff Layton <jlayton@...nel.org>
Powered by blists - more mailing lists