[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1161707186.20134.26.camel@kleikamp.austin.ibm.com>
Date: Tue, 24 Oct 2006 11:26:26 -0500
From: Dave Kleikamp <shaggy@...tin.ibm.com>
To: David Chinner <dgc@....com>
Cc: Jeff Garzik <jeff@...zik.org>, Alex Tomas <alex@...sterfs.com>,
Theodore Tso <tytso@....edu>, Jan Kara <jack@...e.cz>,
linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org
Subject: Re: [RFC] Ext3 online defrag
On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
> On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> > On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> > > That's the wrong way to look at it. if you want the userspace
> > > process to specify a location, then you should preallocate it first
> > > before doing anything else. There is no need to clutter a simple
> > > data mover interface with all sorts of unnecessary error handling.
> >
> > You are implying the the 2-step interface, creating a new inode then
> > swapping the contents, is the only way to implement this.
>
> No, it's not the only way to implement it, but it seems the cleanest
> way to me when you have to consider crash recovery. With a temporary
> inode, you can create it, hold a reference and then unlink it so
> that any crash at that point will free the inode and any extents
> it has on it.
>
> The only way I can see anything different working is having the
> filesystem hold extents somewhere internally that provides us the
> same recovery guarantees while we copy the data and insert the new
> extents. This is obviously a filesystem specific solution and is
> more complex to implement than a swap extent transaction. it
> probably also needs on disk format changes to support properly....
This is definitely filesystem-dependent. I would think allocating an
extent would be like any other allocation done by the filesystem, and
there are already recovery mechanisms for that.
> > > Once you've separated the destination allocation from the data
> > > mover, the mover is basically a splice copy from source to
> > > destination, an fsync and then an atomic swap blocks/extents operation.
> > > Most of this code is generic, and a per-fs swap-extents vector
> > > could be easily provided for the one bit that is not....
> >
> > The benefit of having such a simple data mover is negated by moving the
> > complexity into the allocator.
>
> What complexity does it introduce that the allocator doesn't already
> have or needs to provide for the single call interface to work?
I don't see it as any more or less complex than a single interface.
> > A single interface that would move a part of a file at a time has the
> > advantage that a large file which is only fragmented in a few areas does
> > not need to be completely moved.
>
> And the two-step process can do exactly this as well - splice can
> work on any offset within the file...
I wasn't aware of that. That makes your proposal sound a lot better.
> > > The allocation interface, OTOH, is anything but simple and is really
> > > a filesystem specific interface. Seems logical to me to separate
> > > the two.
> >
> > So what then is the benefit of having a simple generic data mover if
> > every file system needs to implement it's own interface to allocate a
> > copy of the data?
>
> I assume you meant "....allocate the space to store the copy of the data."
Yeah.
> The allocation interface needs to be be able to be extended
> independently of the data mover interface. XFS already exposes
> allocation ioctls to userspace for preallocation and we've got plans
> to extnd this further to allow userspace controlled allocation for
> smart defrag tools for XFS. Tying allocation to the data mover
> just makes the interface less flexible and harder to do anything
> smart with....
Okay. It would be nice to standardize the interface so we don't have
every filesystem introducing new ioctls.
> Cheers,
>
> Dave.
--
David Kleikamp
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists