[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <AC64AEB4-4090-4E73-A53C-2ACF94B49AD5@mac.com>
Date: Sun, 16 Sep 2007 03:07:11 -0400
From: Kyle Moffett <mrmacman_g4@....com>
To: Andreas Dilger <adilger@...sterfs.com>
Cc: Evgeniy Polyakov <johnpol@....mipt.ru>,
Jeff Garzik <jeff@...zik.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: Distributed storage. Move away from char device ioctls.
On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:
> On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
>> Yes, block device itself is not able to scale well, but it is the
>> place for redundancy, since filesystem will just fail if
>> underlying device does not work correctly and FS actually does not
>> know about where it should place redundancy bits - it might happen
>> to be the same broken disk, so I created a low-level device which
>> distribute requests itself.
>
> I actually think there is a place for this - and improvements are
> definitely welcome. Even Lustre needs block-device level
> redundancy currently, though we will be working to make Lustre-
> level redundancy available in the future (the problem is WAY harder
> than it seems at first glance, if you allow writeback caches at the
> clients and servers).
I really think that to get proper non-block-device-level filesystem
redundancy you need to base it on something similar to the GIT
model. Data replication is done in specific-sized chunks indexed by
SHA-1 sum and you actually have a sort of "merge algorithm" for when
local and remote changes differ. The OS would only implement a very
limited list of merge algorithms, IE one of:
(A) Don't merge, each client gets its own branch and merges are manual
(B) Most recent changed version is made the master every X-seconds/
open/close/write/other-event.
(C) The tree at X (usually a particular client/server) is always
used as the master when there are conflicts.
This lets you implement whatever replication policy you want: You
can require that some files are replicated (cached) on *EVERY*
system, you can require that other files are cached on at least X
systems. You can say "this needs to be replicated on at least X% of
the online systems, or at most Y". Moreover, the replication could
be done pretty easily from userspace via a couple syscalls. You also
automatically keep track of history with some default purge policy.
The main point is that for efficiency and speed things are *not*
always replicated; this also allows for offline operation. You would
of course have "userspace" merge drivers which notice that the tree
on your laptop is not a subset/superset of the tree on your desktop
and do various merges based on per-file metadata. My address-book,
for example, would have a custom little merge program which knows
about how to merge changes between two address book files, asking me
useful questions along the way. Since a lot of this merging is
mechanical, some of the code from GIT could easily be made into a
"merge library" which knows how to do such things.
Moreover, this would allow me to have a "shared" root filesystem on
my laptop and desktop. It would have 'sub-project'-type trees, so
that "/" would be an independent branch on each system. "/etc" would
be separate branches but manually merged git-style as I make
changes. "/home/*" folders would be auto-created as separate
subtrees so each user can version their own individually. Specific
subfolders (like address-book, email, etc) would be adjusted by the
GUI programs that manage them to be separate subtrees with manual-
merging controlled by that GUI program.
Backups/dumps/archival of such a system would be easy. You would
just need to clone the significant commits/trees/etc to a DVD and
replace the old SHA-1-indexed objects to tiny "object-deleted" stubs;
to rollback to an archived version you insert the DVD, "mount" it
into the existing kernel SHA-1 index, and then mount the appropriate
commit as a read-only volume somewhere to access. The same procedure
would also work for wide-area-network backups and such.
The effective result would be the ability to do things like the
following:
(A) Have my homedir synced between both systems mostly-
automatically as I make changes to different files on both systems
(B) Easily have 2 copies of all my files, so if one system's disk
goes kaput I can just re-clone from the other.
(C) Keep archived copies of the last 5 years worth of work,
including change history, on a stack of DVDs.
(D) Synchronize work between locations over a relatively slow
link without much work.
As long as files were indirectly indexed by sub-block SHA1 (with the
index depth based on the size of the file), and each individually-
SHA1-ed object could have references, you could trivially have a 4TB-
sized file where you modify 4 bytes at a thousand random locations
throughout the file and only have to update about 5MB worth of on-
disk data. The actual overhead for that kind of operation under any
existing filesystem would be 100% seek-dominated regardless whereas
with this mechanism you would not directly be overwriting data and so
you could append all the updates as a single 5MB chunk. Data reads
would be much more seek-y, but you could trivially have an on-line
defragmenter tool which notices fragmented commonly-accessed inode
objects and creates non-fragmented copies before deleting the old ones.
There's a lot of other technical details which would need resolution
in an actual implementation, but this is enough of a summary to give
you the gist of the concept. Most likely there will be some major
flaw which makes it impossible to produce reliably, but the concept
contains the things I would be interested in for a real "networked
filesystem".
Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists