[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20060914134831.GE28663@openx1.frec.bull.fr>
Date: Thu, 14 Sep 2006 15:48:31 +0200
From: Alexandre Ratchov <alexandre.ratchov@...l.net>
To: Andreas Dilger <adilger@...sterfs.com>
Cc: Trond Myklebust <trond.myklebust@....uio.no>,
linux-ext4@...r.kernel.org, nfsv4@...ux-nfs.org
Subject: Re: rfc: [patch] change attribute for ext3
On Thu, Sep 14, 2006 at 03:23:18AM -0600, Andreas Dilger wrote:
> On Sep 13, 2006 20:30 +0200, Alexandre Ratchov wrote:
> > On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > > On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > > > the change attribute is a simple counter that is reset to zero on
> > > > inode creation and that is incremented every time the inode data is
> > > > modified (similarly to the "ctime" time-stamp).
> > >
> > > I would really have preferred a full-blown 64-bit counter as per
> > > RFC3530, but I suppose we could always combine this change attribute
> > > with the high word from ctime in order to make up the NFSv4 change
> > > attribute. That should keep us safe until someone develops a ramdisk
> > > with < 1 nsecond access time.
> >
> > do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> > would allow 2^32 inode changes per second.
>
> It might be preferrable, since we are depending on the ctime here anyways,
> is to combine this with the nsec-resolution ctime, and kill two birds with
> one field in the inode.
>
> The implementation would be to update the ctime+nsec field as normal, but
> in the unlikely case that both the second+nsec ctime is the same as before
> the nsec value would be incremented by 1. This could happen in case of
> low-resolution kernel timers, and would also handle the future case where
> the inode is modified more than once in the same nanosecond.
>
> The other benefit is that it allows comparisons between two different
> inodes to be more meaningful, instead of just using the seconds + random
> version number.
>
> It would be possible/desirable to make the nsec ctime field be part of the
> small inode (using the proposed reserved field) instead of the large inode,
> since that is a requirement for working with existing ext3 filesystems. The
> previous nsec timestamp patch would only need trivial modifications to make
> this work, just #define i_ctime_extra to be l_i_reserved1 I believe.
>
there is something i dislike with incrementing the nsec value. The ctime is
a global (as opposed to per-inode) time reference for the file-system. And
it is expected to be globally coherent; imagine the following situation:
Within the same time-slice (with time-stamp T0, in nanoseconds), we do the
following in this order:
change file1 -> ctime = T0
change file2 -> ctime = T0
change file2 -> ctime = T0 + 1
change file2 -> ctime = T0 + 2
change file1 -> ctime = T0 + 1
so it appears that file2 is strictly newer than file1, which is false. So
the assumption "if ctime(file1) < ctime(file2) then file2 is newer that
file1" is no longer true.
In order to fix this, we'll need to increment a global counter, not a
pre-inode counter. It's feasable.
cheers,
-- Alexandre
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists