[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091123164445.GB3292@fieldses.org>
Date: Mon, 23 Nov 2009 11:44:45 -0500
From: "J. Bruce Fields" <bfields@...ldses.org>
To: tytso@....edu
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: i_version, NFSv4 change attribute
On Mon, Nov 23, 2009 at 06:48:31AM -0500, tytso@....edu wrote:
> On Sun, Nov 22, 2009 at 05:20:47PM -0500, J. Bruce Fields wrote:
> > However, the new i_version support is available only when the filesystem
> > is mounted with the i_version mount option. And the change attribute is
> > required for completely correct NFSv4 operation, which we'd prefer to
> > offer by default!
> >
> > I recall having a conversation with Ted Ts'o about ways to do this
> > without affecting non-NFS-exported filesystems: maybe by providing some
> > sort of persistant flag on the superblock which the nfsd code could turn
> > on automatically if desired?
> >
> > But first it would be useful to know whether there is in fact any
> > disadvantage to just having the i_version on all the time. Jean Noel
> > Cordenner did some tests a couple years ago:
> >
> > http://www.bullopensource.org/ext4/change_attribute/index.html
> >
> > and didn't find any significant difference. But I don't know if those
> > results were convincing (I don't understand what we're looking for
> > here); any suggestions for workloads that would exercise the worst
> > cases?
>
> Hmmm.... the workload that would probably see the most hit would be
> one where we have multiple processes/thread running on different cpu's
> that are modifying the inode in parallel. (i.e., the sort of workload
> that a database with multiple clients would naturally see).
Got it, thanks. Is there an existing easy-to-setup workload I could
start with, or would it be sufficient to try the simplest possible code
that met the above description? (E.g., fork a process for each cpu,
each just overwriting byte 0 as fast as possible, and count total writes
performed per second?)
> The test which Bull did above used a workload which was very heavy on
> creating and destroying files, and it was only a two processor system
> (not that it mattered; it looks like the fileop benchmark was
> single-threaded anyway). The test I would do is something like a 4 or
> 8 processor test, with lots of parallel I/O to the same file (at which
> point we would probably end up bottlenecking on inode->i_lock).
>
> It would seem to me that a simple way of fixing this would be to use
> atomic64 type for inode->i_version, so we don't have to take the
> spinlock and bounce cache lines each time i_version gets updated.
The current locking does seem like overkill.
> What we might decide, at the end of the day, is that for common fs
> workloads no one is going to notice, and for the parallel intensive
> workloads (i.e., databases), people will be tuning for this anyway, so
> we can make i_version the default, and noi_version an option people
> can use to turn off i_version if they are optimizing for the database
> workload.
>
> A similar tuning knob that I should add is one that allows us to set
> a custom value for sb->s_time_gran, so that we don't have to dirty the
> inode and engage the jbd2 machinery after *every* single write. Once
> I add that, or if you use i_version on a file system with an 128-byte
> inode so the mtime update granularity is a second, I suspect the cost
> of i_version will be especially magnified, and the database people
> will very much want to turn off i_version.
>
> And that brings up another potential compromise --- what if we only
> update i_version every n milliseconds? That way if the file is being
> modified to the tune of hundreds of thousands of updates a second,
> NFSv4 clients will see a change fairly quickly, with n milliseconds,
> but we won't be incrementing i_version at incredibly high rates. I
> suspect that would violate NFSv4 protocol specs somewhere, but would
> it cause seriously noticeable breakage that would be user visible? If
> not, maybe that's something we should allow the user to set, perhaps
> not as the default? Just a thought.
That knob would affect the probability of the breakage, but not
necessarily the seriousness. The race is:
- clock tick
write
read and check i_version
write
- clock tick
If the second write doesn't modify the i_version, the client will never
see it. (Unless a later write bumps i_version again.)
If the side we want to optimize is the modifications, I wonder if we
could do all the i_version increments on *read* of i_version?:
- writes (and other inode modifications) set an "i_version_dirty"
flag.
- reads of i_version clear the i_version_dirty flag, increment
i_version, and return the result.
As long as the reader sees i_version_flag set only after it sees the
write that caused it, I think it all works?
--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists