[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080613164110.GB26166@2ka.mipt.ru>
Date: Fri, 13 Jun 2008 20:41:12 +0400
From: Evgeniy Polyakov <johnpol@....mipt.ru>
To: linux-kernel@...r.kernel.org
Cc: netdev@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: [2/3] POHMELFS: Documentation.
Design notes, usage cases and protocol description.
Signed-off-by: Evgeniy Polyakov <johnpol@....mipt.ru>
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
new file mode 100644
index 0000000..c9a9379
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -0,0 +1,61 @@
+POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
+
+ Evgeniy Polyakov <johnpol@....mipt.ru>
+
+Homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=pohmelfs
+
+It was first started as network filesystem with coherent local data and metadata caches,
+but it is being evolved into parallel distibuted filesystem now.
+
+Main features of this FS include:
+ * Local coherent (notes cache for data and metadata:
+ http://tservice.net.ru/~s0mbre/blog/devel/fs/2008_05_17.html)
+ * Completely async processing of all events (hard, symlinks and rename are the
+ only exceptions) including object creation and data reading and writing.
+ * Flexible object architecture optimized for network processing.
+ Ability to create long pathes to object and remove arbitrary huge
+ directoris in single network command.
+ (like removing the whole kernel tree via single network command).
+ * Very high performance.
+ * Fast and scalable multithreaded userspace server. Being in userspace it works
+ with any underlying filesystem and still is much faster than async in-kernel NFS one.
+ * Client is able to switch between different servers (if one goes down, client
+ automatically reconnects to second and so on).
+ * Transactions support. Full failover for all operations.
+ Resending transactions to different servers on timeout or error.
+ * Read requests (data read, directory listing, lookup requests) balancing between multiple servers.
+ * Write requests are sent to multiple servers and completed only when all of them sent an ack.
+ * Ability to add and/or remove servers from working set at run-time from userspace (via netlink,
+ so the same command can be processed from real network though, but since server does
+ not support it yet, I dropped network part).
+
+
+POHMELFS is based on transactions, which are potentially long-standing objects, which live
+in clients memory. Each transaction contains all information needed to process given command
+(or set of command, which is frequently used during data writing: single transaction can contain
+creation and data writing commands). Transaction is committed by all servers, where it was sent,
+and in case of failure of one or another, it will be eventually resent or dropped with error,
+so, for example, reading will return error, if no servers are available.
+
+POHMELFS uses novel asynchronous approach of data processing. Courtesy to transactions, it is
+possible to detouch reply from request, and if command requires data to be received, caller
+just sleeps waiting for it. Thus it is possible to issue multiple read commands to different
+servers and async threads will pick replies in parallel, find appropriate transactions in the
+system and put data where it belongs (like page or inode cache).
+
+Main feature of the POHMELFS is writeback data and metadata cache.
+Only few non-performance critical operations use write-through cache and are synchronous:
+hard and symbolic link creation and object rename. Creation and removal of objects,
+as long as writing, are asynchronous and are sent to the server during system writeback.
+When server receives some request for given object in the system (like data reading,
+or file creation or whatever else), it stores appropriate client information in own cache,
+so when subsequent request comes from different client, all previous could be notified
+(for example when several clients read data from file, and then new client writes there,
+appropriate pages on clients will be invalidated, so subsequent write will force them
+to read page from the server). Because of this feature POHMELFS is extremely fast in metadata
+intensive workloads, and can fully utilize bandwidth to servers when doing bulk data transafers.
+
+POHMELFS client operates with working set of servers, and is capable of balancing reading,
+i.e. no modification, like lookup or directory listing, workload between all servers it was conneted to.
+Administrator can add or remove servers from the set in run-time via special command (described
+in Documentation/pohmelfs/info.txt file). Writing is performed to all servers.
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt
new file mode 100644
index 0000000..1b4a3e3
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/info.txt
@@ -0,0 +1,61 @@
+POHMELFS usage information.
+
+Mount options:
+idx=%u
+ Each mountpoint is associated with special index via this option.
+ Administrator can add or remove servers from given index, so all mounts,
+ which were attached to it, were updated.
+ Default it is 0.
+
+trans_scan_timeout=%u
+ This timeout, expressed in milliseconds, specifies time to scan trasaction
+ trees looking for stale requests, which have to be resent, or if number of
+ retries exceed specified limit, dropped with error.
+ Default is 5 seconds.
+
+drop_scan_timeout=%u
+ Internal timeout, expressed in milliseconds, which specifies how frequently
+ inodes marked to be dropped are freed. It also specifies how frequently
+ system checks, that servers has to be added or removed from current working set.
+ Default is 1 second.
+
+wait_on_page_timeout=%u
+ Number of milliseconds to wait for reply from remote server for data reading command.
+ If this timeout is exceeded, reading returns error.
+ Default is 5 seconds.
+
+trans_retries=%u
+ Number of times, transaction will be resent to the server, which did not answer for the
+ last @trans_scan_timeout milliseconds. When number of resends exceeds this limit,
+ transaction is completed with error.
+ Default is 5 resends.
+
+
+Usage examples.
+
+Add (or remove if it already exists) server server1.net:1025 into working set with index $idx:
+$cfg -a server1.net -p 1025 -i $idx
+
+Mount filesystem with given index $idx to /mnt mountpoint.
+Client will connect to all servers specified in working set via previous command:
+mount -t pohmel -o idx=$idx q /mnt
+
+One can add or remove servers from working set after mounting too.
+
+
+Server installation.
+
+Creating a server, which listens at port 1025 and 0.0.0.0 address.
+Working root directory (note, that server chroots there, so you have to have appropriate permissions)
+is set to /mnt. Number of working threads is set to 10.
+
+# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10
+
+ -r root - path to root directory. Default: /tmp.
+ -a addr - listen address. Default: 0.0.0.0.
+ -p port - listen port. Default: 1025.
+ -w workers - number of workers per connected client. Default: 1.
+ -h - this help.
+
+Number of worker threads specifies how many workers will be created for each client.
+Bulk single-client transafers usually are better handled with smaller number (like 1-3).
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt
new file mode 100644
index 0000000..f14910a
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/network_protocol.txt
@@ -0,0 +1,202 @@
+POHMELFS network protocol.
+
+Basic structure used in network communication is following command:
+
+struct netfs_cmd
+{
+ __u16 cmd; /* Command number */
+ __u16 ext; /* External flags */
+ __u32 size; /* Size of the attached data */
+ __u64 id; /* Object ID to operate on. Used for feedback.*/
+ __u64 start; /* Start of the object. */
+ __u8 data[];
+};
+
+Commands can be embedded into transaction command (which in turn has own command),
+so one can extend protocol as needed without breaking backward compatibility as long
+as old commands are supported. All string lengths include tail 0 byte.
+
+@cmd - command number, which specifies command to be processed. Following
+ commands are used currently:
+
+ NETFS_READDIR = 1, /* Read directory for given inode number */
+ NETFS_READ_PAGE, /* Read data page from the server */
+ NETFS_WRITE_PAGE, /* Write data page to the server */
+ NETFS_CREATE, /* Create directory entry */
+ NETFS_REMOVE, /* Remove directory entry */
+ NETFS_LOOKUP, /* Lookup single object */
+ NETFS_LINK, /* Create a link */
+ NETFS_TRANS, /* Transaction */
+ NETFS_OPEN, /* Open intent */
+ NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */
+ NETFS_JOIN_GROUP, /* Joing metadata update group */
+ NETFS_LEAVE_GROUP, /* Leave metadata update group */
+ NETFS_PAGE_CACHE, /* Page cache invalidation message */
+ NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */
+ NETFS_RENAME, /* Rename object */
+
+@ext - external flags. Used by different commands to specify some extra arguments
+ like partial size of the embedded objects or creation flags.
+
+@...e - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
+ but size of the requested data is incorporated here. It does not include size of the command
+ header (struct netfs_cmd) itself.
+
+@id - id of the object this command operates on. Each command can use it for own purpose.
+
+@...rt - start of the object this command operates on. Each command can use it for own purpose.
+
+
+Command specifications.
+
+@...FS_READDIR
+This command is used to sync content of the remote dir to the client.
+
+@ext - length of the path to object.
+@...e - the same.
+@id - local inode number of the directory to read.
+@...rt - zero.
+
+
+@...FS_READ_PAGE
+This command is used to read data from remote server.
+Data size does not exceed local page cache size.
+
+@id - inode number.
+@...rt - first byte offset.
+@...e - number of bytes to read plus length of the path to object.
+@ext - object path length.
+
+
+@...FS_CREATE
+Used to create object.
+It does not require that all directories on top of the object were
+already created, it will create them automatically. Each object has
+associated @netfs_path_entry data structure, which contains creation
+mode (permissions and type) and length of the name as long as name itself.
+
+@...rt - 0
+@...e - size of the all data structures needed to create a path
+@id - local inode number
+@ext - 0
+
+
+@...FS_REMOVE
+Used to remove object.
+
+@ext - length of the path to object.
+@...e - the same.
+@id - local inode number.
+@...rt - zero.
+
+
+@...FS_LOOKUP
+Lookup information about object on server.
+
+@ext - length of the path to object.
+@...e - the same.
+@id - local inode number of the directory to look object in.
+@...rt - local inode number of the object to look at.
+
+
+@...FS_LINK
+Create hard of symlink.
+Command is sent as "object_path|target_path".
+
+@...e - size of the above string.
+@id - parent local inode number.
+@...rt - 1 for symlink, 0 for hardlink.
+@ext - size of the "object_path" above.
+
+
+@...FS_TRANS
+Transaction header.
+
+@...e - incorporates all embedded command sizes including theirs header sizes.
+@...rt - transaction generation number - unique id used to find transaction.
+@ext - transaction flags. Unused at the moment.
+@id - 0.
+
+
+@...FS_OPEN
+Open intent for given transaction.
+
+@id - local inode number.
+@...rt - 0.
+@...e - path length to the object.
+@ext - open flags (O_RDWR and so on).
+
+
+@...FS_INODE_INFO
+Metadata update command.
+It is sent to servers when attributes of the object are changed and received
+when data or metadata were updated. It operates with the following structure:
+
+struct netfs_inode_info
+{
+ unsigned int mode;
+ unsigned int nlink;
+ unsigned int uid;
+ unsigned int gid;
+ unsigned int blocksize;
+ unsigned int padding;
+ __u64 ino;
+ __u64 blocks;
+ __u64 rdev;
+ __u64 size;
+ __u64 version;
+};
+
+It effectively mirrors stat(2) returned data.
+
+
+@ext - path length to the object.
+@...e - the same plus size of the netfs_inode_info structure.
+@id - local inode number.
+@...rt - 0.
+
+
+@...FS_JOIN_GROUP/NETFS_LEAVE_GROUP
+Metadata cache coherency synchronization messages.
+They are broadcasted when new inode is created (either for new object
+or object read from the server), so that server new inode number of the
+object on the appropriate client. @NETFS_LEAVE_GROUP is sent when local
+inode is destroyed, so that client physically can not be interested in
+data or metadata updates for given inode.
+
+@ext - path length to the object.
+@...e - the same.
+@id - local inode number for given object.
+@...rt - 0.
+
+
+@...FS_PAGE_CACHE
+Command is only received by clients. It contains information about
+page to be marked as not up-to-date. Server fills this command with data,
+provided by above @NETFS_JOIN_GROUP command.
+
+@id - client's inode number.
+@...rt - last byte of the page to be invalidated. If it is not equal to
+ current inode size, it will be vmtruncated().
+@...e - 0
+@ext - 0
+
+
+@...FS_READ_PAGES
+Used to read multiple contiguous pages in one go.
+
+@...rt - first byte of the contiguous region to read.
+@...e - contains of two fields: lower 8 bits are used to represent page cache shift
+ used by client, another 3 bytes are used to get number of pages.
+@id - local inode number.
+@ext - path length to the object.
+
+
+@...FS_RENAME
+Used to rename object.
+Attached data is formed into following string: "old_path|new_path".
+
+@id - local inode number.
+@...rt - parent inode number.
+@...e - length of the above string.
+@ext - length of the old path part.
--
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists