linux-kernel - [RFC] Tux3 for review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 16 May 2014 17:50:59 -0700
From:	Daniel Phillips <d.phillips@...tner.samsung.com>
To:	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	tux3@...3.org
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: [RFC] Tux3 for review

We would like to offer Tux3 for review for mainline merge. We have 
prepared a new repository suitable for pulling:

https://git.kernel.org/cgit/linux/kernel/git/daniel/linux-tux3.git/

Tux3 kernel module files are here:

https://git.kernel.org/cgit/linux/kernel/git/daniel/linux-tux3.git/tree/fs/tux3

Tux3 userspace tools and tests are here:

https://git.kernel.org/cgit/linux/kernel/git/daniel/linux-tux3.git/tree/fs/tux3/user?h=user

Repository

We are moving our development to the kernel.org tree from our standalone 
Github repository. Our history was imported from the standalone 
repository using git am. Our kernel.org tree is the usual fork of Linus 
mainline, with Tux3 kernel files on the master branch and userspace 
files in fs/tux3/user on the user branch. We maintain the user files in 
our kernel tree because Tux3 has a tighter coupling than usual between 
userspace and kernel.

Most of our kernel code also runs in userspace, for testing or as a fuse 
filesystem or as part of our userspace support. We also need to keep our 
master branch clean of userspace files. These conflicting requirements 
creates challenges for our workflow. We can't just merge from user to 
master because that would pull in userspace files to kernel, and we 
can't merge from master to user because that would pull the entire 
kernel history into our branch. The best idea we have come up with is to 
cherry-pick changes from user to master and master to user. This creates 
merge noise in our user history and requires care to avoid combining 
kernel and userspace changes in the same commit. At least, this is 
better than having two completely separate repositories. Probably. We 
would appreciate any comment on how this workflow could be improved.

For the time being, the subtree at fs/tux3 can also be used standalone. 
Run make in fs/tux3 to build a kernel module for the running kernel 
version. Run make in fs/tux3/user to build userspace commands including 
"tux3 mkfs". Run "make tests" in fs/tux3/user to run our unit tests. 
This capability might be useful for people interested in experimenting 
with Tux3 in user space, and is handy for a quick build of the user 
support without needing to pull the whole repository.

The tux3 command built in fs/tux3/user provides our support tools 
including "tux3 mkfs" and "tux3 fsck". For now, we do not build a 
standalone mkfs.tux3 and consider that a feature, not a bug, because it 
sends the message that Tux3 is for developers right now.

API changes

Tux3 does not implement any custom or extended interfaces.

Core changes

Tux3 builds without any core changes, however we do some unnatural 
things to enable that. We would like to have some core changes to clean 
this up. One is a correctness issue for mmap and three others are to 
clean up ugly workarounds. Without any core changes, mmap will be 
disabled because there is a potential for stale cache pages with 
combined file and mmap IO. I will describe them here and provide patches 
if asked:

1. mmap

Our "page fork" technique does copy-on-write on cache pages in order to 
enforce strict delta ordering, which prevents changing pages already 
under IO as a side effect. For mmap, we do the page fork in 
->page_mkwrite, which needs to be able to change the target page. 
Without this ability, we fault twice for each page_mkwrite, and we 
cannot close all races. We also have an ugly hack to export a 
page_cow_file symbol to our module without patching core.

2. Free a forked page

A forked page that goes out of scope after IO must be freed. We 
currently do that in an ugly way by polling for refcount to go to zero.

3. Cgroup interaction

We need some unexported functions to support cgroup.

4. Inode flushing

To enforce strong ordering, we flush inodes in a certain order that core 
knows nothing about. Allowing core to flush our inodes using its current 
algorithm would cause corruption. We would like a new fs-specific hook 
to call our own flushing algorithm. Without that, we replicate part of 
the core flushing code to call the tux3 flusher. Code for this is in 
commit_flusher.c and commit_flusher_hack.c. Alternatively we can try to 
improve the core flusher to meet our needs, or do both: develop a 
generic, improved flusher within Tux3 using the hook, test it a lot, 
then propose it for core. We would be more than happy to join in the 
active effort to improve the core flusher.

Style

We are not perfectly checkpatch clean. We run checkpatch like this:

    scripts/checkpatch.pl -f fs/tux3/*.[ch] --ignore 
PRINTF_L,C99_COMMENTS,SPLIT_STRING,SUSPECT_CODE_INDENT,LONG_LINE -q

With that, checkpatch still has a few complaints, but not too
many. Our rationale for suppressing some checkpatch complaints:

    PRINTF_L: printk supports it. It is shorter and nicer to our eyes.
    Checkpatch complains that it is not standard C, but it is not clear
    why that matters for kernel code. If anybody cares strongly, we will
    change %L to %ll.

    C99_COMMENTS: We use them sparingly as a shorthand for "FIXME: <line
    where fix is obviously needed>". Will go away as fixes arrive.

    SPLIT_STRING: We split some strings to fit in 80 columns. If anybody
    hates that, we will change them back to long lines.

    SUSPECT_CODE_INDENT: False positives

    LONG_LINE: There are a few long lines, where readability would be
    worse with splitting. We take our guidance from Linus:

        http://yarchive.net/comp/linux/coding_style.html

    If we made some line unreadable that way, please let us know and we
    will fix it.

Other issues

Declarations after Statements. We have some declarations after 
statements, mostly in the userspace code but also some in the kernel 
code. We have -Wno-declaration-after-statement in tux3/Makefile to build 
without warnings. We think that tasteful use of this C99 extension 
improves our code readability and maintainability. We would prefer to 
keep these if nobody objects.

Source includes. We include C files in a few places instead of linking 
them, typically because it is easier to maintain that way. This 
technique is already used in various places in kernel. Can be changed if 
necessary.

Fitness for use

Tux3 is not fit for use as of today and will eat your data. The most 
glaring deficiency is that Tux3 goes BUG on ENOSPC. Some expected 
interfaces are missing. like direct io, xattrs and atime. Some 
performance patches are out of tree, to be merged later. This includes 
directory indexing, so directories over a few thousand files will slow 
to a crawl. Tux3 survives our stress testing, but that does not mean it 
will survive your stress testing.

Purpose

We think that Tux3 fills a niche in the Linux ecology where a light, 
tight, modern filesystem belongs. We offer a fresh approach to some 
ancient problems. Tux3's best trick is strong consistency without the 
overhead that you might expect. Our obsession with minimal resource 
consumption, including disk space, CPU overhead and cache memory makes 
Tux3 promising for personal and embedded use. Tux3's feature set is not 
enterprise grade by any stretch of the imagination, but we hope to 
accrete some big system features over time. Any of several existing 
Linux filesystems already do a nice job of servicing that space, so we 
do not need to rush that. Tux3's special mission is to focus on basic 
functionality that is really robust, fast and simple.

Quick tour

Tux3 has thirty three c source files and thirteen header files, 
comprising about 18 thousand lines. Some files are the familiar ones 
from Ext2: balloc.c, dir.c, inode.c, namei.c, super.c and xattr.c.

Our btree code is a generic OOP-like btree class implemented in btree.c. 
Subclasses for different btree types are provided by specialized leaf 
methods in dleaf.c and ileaf.c, for file data btrees and our inode table 
tree, respective. We reuse the ileaf.c methods in orphan.c to store 
orphaned inodes.

The main workhorse of Tux3 is filemap.c, which maps between logical and 
physical file extents for read and write. This is analogous to 
ext2_get_block but more complex because of extents and btrees. This 
spreads out over several subfiles for modularity: filemap_blocklib.c, 
filemap_hole.c, filemap_mmap.c.

Our delta commit model is implemented in commit.c and its subfiles 
commit_flusher.c and commit_flusher_hack.c. This is supported by log.c 
and replay.c, to emit log records and replay them on mount. Flushing out 
dirty cache is a major Tux3 obsession, implemented in writeback.c and 
its subfiles writeback_iattrfork.c, writeback_inodedelete.c and 
writeback_xattrfork.c

We use buffers as handles for cache blocks, and have some unique 
requirements there, so we have buffer.c with subfiles buffer_fork.c, 
buffer_writeback.c, and buffer_writebacklib.c. These implement our block 
fork concept. A "bufvec" batching technique translates buffers to bios 
for fast IO.

Digression: there might be something generically useful in our buffer 
code, however in the long run we would rather replace buffer_head 
entirely than try to fix it. Probably, we can save significant CPU and 
memory using a framework that specifically provides cache block handles 
and not other traditional buffer_head IO functionality. So buffer_head 
eradication is in our future work queue and our factoring here reflects 
that.

Our scheme for variable sized inodes with optional attributes is 
implemented in iattr.c. Block allocation is lightly factored into policy 
and mechanism, with the policy bits hived off into policy.c. 
Inode_defer.c is a subfile of inode.c and decouples frontend file 
creation code from backend inode table updating. In inode_vfslib.c we 
duplicate some core kernel code, which will go away if we can export the 
proper core functionality as described earlier. Our ugly hack to export 
page_cow_file is in mmap_builtin_hack.c. In utility.c we have a few 
functions that could possibly become generic.

We encapsulate some of our internal APIs in header files, so we have 
quite a few of those. We also have kcompat.h to support building our 
module over a range of kernel versions. This will go away but is not 
gone yet. In link.h we have a single linked list implementation somewhat 
resembling the list.h API. We could possibly replace that by llist.h or 
something like it. It is less than a hundred lines so it might be wiser 
to just leave it.

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/