[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170405172625.GB28681@fieldses.org>
Date: Wed, 5 Apr 2017 13:26:25 -0400
From: "J. Bruce Fields" <bfields@...ldses.org>
To: NeilBrown <neil@...wn.name>
Cc: Jeff Layton <jlayton@...hat.com>, Jan Kara <jack@...e.cz>,
Christoph Hellwig <hch@...radead.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-nfs@...r.kernel.org, linux-ext4@...r.kernel.org,
linux-btrfs@...r.kernel.org, linux-xfs@...r.kernel.org
Subject: Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> >> > > > i_version numbers we gave away but did not write to disk but still...).
> >> > > > Thoughts?
> >> >
> >> > How hard is this for filesystems to support? Do they need an on-disk
> >> > format change to keep track of the crash counter? Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> >
> >>
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead. So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
>
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.
I was sort of ignoring the distinction between concatenate(A,B) and
A*m+B, but, sure, multiplying's probably better.
> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields. Maybe there are enough bits.
i_version and the NFSv4 change attribute are 64 bits which gives us a
fair amount of flexibility.
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb. Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
>
> If you want to change a file with an even i_version, you subtract
> crash-count*large-number
> to the i_version, then set lsb. This is written to stable storage before
> the change.
>
> If a file has not been changed for a while, you can add
> crash-count*large-number
> and clear lsb.
>
> The lsb of the i_version would be for internal use only. It would not
> be visible outside the filesystem.
>
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.
I'm not sure how to model the costs. Something like that might work.
--b.
Powered by blists - more mailing lists