linux-kernel - Re: Distributed storage. Move away from char device ioctls.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 16 Sep 2007 03:07:11 -0400
From:	Kyle Moffett <mrmacman_g4@....com>
To:	Andreas Dilger <adilger@...sterfs.com>
Cc:	Evgeniy Polyakov <johnpol@....mipt.ru>,
	Jeff Garzik <jeff@...zik.org>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: Distributed storage. Move away from char device ioctls.

On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:
> On Sep 15, 2007  16:29 +0400, Evgeniy Polyakov wrote:
>> Yes, block device itself is not able to scale well, but it is the  
>> place for redundancy, since filesystem will just fail if  
>> underlying device does not work correctly and FS actually does not  
>> know about where it should place redundancy bits - it might happen  
>> to be the same broken disk, so I created a low-level device which  
>> distribute requests itself.
>
> I actually think there is a place for this - and improvements are  
> definitely welcome.  Even Lustre needs block-device level  
> redundancy currently, though we will be working to make Lustre- 
> level redundancy available in the future (the problem is WAY harder  
> than it seems at first glance, if you allow writeback caches at the  
> clients and servers).

I really think that to get proper non-block-device-level filesystem  
redundancy you need to base it on something similar to the GIT  
model.  Data replication is done in specific-sized chunks indexed by  
SHA-1 sum and you actually have a sort of "merge algorithm" for when  
local and remote changes differ.  The OS would only implement a very  
limited list of merge algorithms, IE one of:

(A)  Don't merge, each client gets its own branch and merges are manual
(B)  Most recent changed version is made the master every X-seconds/ 
open/close/write/other-event.
(C)  The tree at X (usually a particular client/server) is always  
used as the master when there are conflicts.

This lets you implement whatever replication policy you want:  You  
can require that some files are replicated (cached) on *EVERY*  
system, you can require that other files are cached on at least X  
systems.  You can say "this needs to be replicated on at least X% of  
the online systems, or at most Y".  Moreover, the replication could  
be done pretty easily from userspace via a couple syscalls.  You also  
automatically keep track of history with some default purge policy.

The main point is that for efficiency and speed things are *not*  
always replicated; this also allows for offline operation.  You would  
of course have "userspace" merge drivers which notice that the tree  
on your laptop is not a subset/superset of the tree on your desktop  
and do various merges based on per-file metadata.  My address-book,  
for example, would have a custom little merge program which knows  
about how to merge changes between two address book files, asking me  
useful questions along the way.  Since a lot of this merging is  
mechanical, some of the code from GIT could easily be made into a  
"merge library" which knows how to do such things.

Moreover, this would allow me to have a "shared" root filesystem on  
my laptop and desktop.  It would have 'sub-project'-type trees, so  
that "/" would be an independent branch on each system. "/etc" would  
be separate branches but manually merged git-style as I make  
changes.  "/home/*" folders would be auto-created as separate  
subtrees so each user can version their own individually.  Specific  
subfolders (like address-book, email, etc) would be adjusted by the  
GUI programs that manage them to be separate subtrees with manual- 
merging controlled by that GUI program.

Backups/dumps/archival of such a system would be easy.  You would  
just need to clone the significant commits/trees/etc to a DVD and  
replace the old SHA-1-indexed objects to tiny "object-deleted" stubs;  
to rollback to an archived version you insert the DVD, "mount" it  
into the existing kernel SHA-1 index, and then mount the appropriate  
commit as a read-only volume somewhere to access.  The same procedure  
would also work for wide-area-network backups and such.

The effective result would be the ability to do things like the  
following:
   (A)  Have my homedir synced between both systems mostly- 
automatically as I make changes to different files on both systems
   (B)  Easily have 2 copies of all my files, so if one system's disk  
goes kaput I can just re-clone from the other.
   (C)  Keep archived copies of the last 5 years worth of work,  
including change history, on a stack of DVDs.
   (D)  Synchronize work between locations over a relatively slow  
link without much work.

As long as files were indirectly indexed by sub-block SHA1 (with the  
index depth based on the size of the file), and each individually- 
SHA1-ed object could have references, you could trivially have a 4TB- 
sized file where you modify 4 bytes at a thousand random locations  
throughout the file and only have to update about 5MB worth of on- 
disk data.  The actual overhead for that kind of operation under any  
existing filesystem would be 100% seek-dominated regardless whereas  
with this mechanism you would not directly be overwriting data and so  
you could append all the updates as a single 5MB chunk.  Data reads  
would be much more seek-y, but you could trivially have an on-line  
defragmenter tool which notices fragmented commonly-accessed inode  
objects and creates non-fragmented copies before deleting the old ones.

There's a lot of other technical details which would need resolution  
in an actual implementation, but this is enough of a summary to give  
you the gist of the concept.  Most likely there will be some major  
flaw which makes it impossible to produce reliably, but the concept  
contains the things I would be interested in for a real "networked  
filesystem".

Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/