[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <150174649708.104003.4595004262958377346.stgit@hn>
Date: Thu, 03 Aug 2017 00:48:17 -0700
From: Steven Swanson <swanson@....ucsd.edu>
To: linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-nvdimm@...ts.01.org
Cc: Steven Swanson <steven.swanson@...il.com>, dan.j.williams@...el.com
Subject: [RFC 01/16] NOVA: Documentation
A brief overview is in README.md.
Implementation and usage details are in Documentation/filesystems/nova.txt.
These two papers provide a detailed, high-level description of NOVA's design goals and approach:
NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
Hardening the NOVA File System (http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)
Signed-off-by: Steven Swanson <swanson@...ucsd.edu>
---
Documentation/filesystems/00-INDEX | 2
Documentation/filesystems/nova.txt | 771 ++++++++++++++++++++++++++++++++++++
MAINTAINERS | 8
README.md | 173 ++++++++
4 files changed, 954 insertions(+)
create mode 100644 Documentation/filesystems/nova.txt
create mode 100644 README.md
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index b7bd6c9009cc..dc5c72273957 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -95,6 +95,8 @@ nfs/
- nfs-related documentation.
nilfs2.txt
- info and mount options for the NILFS2 filesystem.
+nova.txt
+ - info on the NOVA filesystem.
ntfs.txt
- info and mount options for the NTFS filesystem (Windows NT).
ocfs2.txt
diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
new file mode 100644
index 000000000000..af90da1c3fb1
--- /dev/null
+++ b/Documentation/filesystems/nova.txt
@@ -0,0 +1,771 @@
+The NOVA Filesystem
+===================
+
+NOVA is a DAX file system designed to maximize performance on hybrid DRAM and
+non-volatile main memory (NVMM) systems while providing strong consistency
+guarantees. NOVA adapts conventional log-structured file system techniques to
+exploit the fast random access that NVMs provide. In particular, it maintains
+separate logs for each inode to improve concurrency, and stores file data
+outside the log to minimize log size and reduce garbage collection costs. NOVA's
+logs provide metadata, data, and mmap atomicity and focus on simplicity and
+reliability, keeping complex metadata structures in DRAM to accelerate lookup
+operations.
+
+The main NOVA features include:
+
+ * POSIX semantics
+ * Directly access (DAX) byte-addressable NVMM without page caching
+ * Per-CPU NVMM pool to maximize concurrency
+ * Strong consistency guarantees with 8-byte atomic stores
+ * Full filesystem snapshot with DAX-mmap support
+ * Checksums on metadata and file data (crc32c)
+ * Full metadata replication and RAID-5 parity per file page
+ * Online filesystem integrity check and corruption recovery
+
+Filesystem Design
+=================
+NOVA divides NVMM into five regions. NOVA's 512 B superblock contains global
+file system information and the recovery inode. The recovery inode represents a
+special file that stores recovery information (e.g., the list of unallocated
+NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
+provides per-CPU journals for complex file operations that involve multiple
+inodes. The rest of the available NVMM stores logs and file data.
+
+NOVA is log-structured and stores a separate log for each inode to maximize
+concurrency and provide atomicity for operations that affect a single file. The
+logs only store metadata and comprise a linked list of 4 KB pages. Log entries
+are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
+pages may reside anywhere in NVMM.
+
+NOVA keeps read-only copies of most file metadata in DRAM during normal
+operations, eliminating the need to access metadata in NVMM during reads.
+
+NOVA uses copy-on-write to provide atomic updates for file data and appends
+metadata about the write to the log. For operations that affect multiple inodes
+NOVA uses lightweight, fixed-length journals – one per core.
+
+NOVA divides the allocatable NVMM into multiple regions, one region per CPU
+core. A per-core allocator manages each of the regions, minimizing contention
+during memory allocation.
+
+After a system crash, NOVA must scan all the logs to rebuild the memory
+allocator state. Since, there are many logs, NOVA aggressively parallelizes the
+scan.
+
+Using NOVA
+==========
+
+NOVA runs on a pmem non-volatile memory region. You can create one of these
+regions with the `memmap` kernel command line option. For instance, adding
+`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
+from address 8GB, and the kernel will create a `pmem0` block device under the
+`/dev` directory.
+
+After the OS has booted, you can initialize a NOVA instance with the following commands:
+
+
+# modprobe nova
+# mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk
+
+
+The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
+`/mnt/ramdisk`.
+
+Nova support several module command line options:
+
+ * metadata_csum: Enable metadata replication and checksums (default 0)
+
+ * data_csum: Compute checksums on file data. (default: 0)
+
+ * data_parity: Compute parity for file data. (default: 0)
+
+ * inplace_data_updates: Update data in place rather than with COW (default: 0)
+
+ * wprotect: Make PMEM unwritable and then use CR0.WP to enable writes as
+ needed (default: 0). You must also install the nd_pmem module as with
+ wprotect =1 (e.g., modprobe nd_pmem readonly=1).
+
+For instance to enable all Nova's data protection features:
+
+# modprobe nova metadata_csum=1\
+ data_csum=1\
+ data_parity=1\
+ wprotect=1
+
+Currently, remounting file systems with different combinations of options may
+not work.
+
+To recover an existing NOVA instance, mount NOVA without the init option, for example:
+
+# mount -t NOVA /dev/pmem0 /mnt/ramdisk
+
+### Taking Snapshots
+
+To create a snapshot:
+
+# echo 1 > /proc/fs/NOVA/<device>/create_snapshot
+
+To list the current snapshots:
+
+# cat /proc/fs/NOVA/<device>/snapshots
+
+To mount a snapshot, mount NOVA and specifying the snapshot index, for example:
+
+# mount -t NOVA -o snapshot=<index> /dev/pmem0 /mnt/ramdisk
+
+Users should not write to the file system after mounting a snapshot.
+
+Source File Structure
+=====================
+
+ * nova_def.h/nova.h
+ Defines NOVA macros and key inline functions.
+
+ * balloc.{h,c}
+ NOVA's block allocator implementation.
+
+ * bbuild.c
+ Implements recovery routines to restore the in-use inode list, the NVMM
+ allocator information, and the snapshot table.
+
+ * checksum.c
+ Contains checksum-related functions to compute and verify checksums on NOVA
+ data structures and file pages, and also performs recovery actions when
+ corruptions are detected.
+
+ * dax.c
+ Implements DAX read/write functions to access file data. NOVA uses
+ copy-on-write to modify file pages by default, unless inplace data update is
+ enabled at mount-time. There are also functions to update and verify the
+ file data integrity information.
+
+ * dir.c
+ Contains functions to create, update, and remove NOVA dentries.
+
+ * file.c
+ Implements file-related operations such as open, fallocate, llseek, fsync,
+ and flush.
+
+ * gc.c
+ NOVA's garbage collection functions.
+
+ * inode.{h,c}
+ Creates, reads, and frees NOVA inode tables and inodes.
+
+ * ioctl.c
+ Implements some ioctl commands to call NOVA's internal functions.
+
+ * journal.{h,c}
+ For operations that affect multiple inodes NOVA uses lightweight,
+ fixed-length journals – one per core. This file contains functions to
+ create and manage the lite journals.
+
+ * log.{h,c}
+ Functions to manipulate NOVA inode logs, including log page allocation, log
+ entry creation, commit, modification, and deletion.
+
+ * mprotect.{h,c}
+ Implements inline functions to enable/disable writing to different NOVA
+ data structures.
+
+ * namei.c
+ Functions to create/remove files, directories, and links. It also looks for
+ the NOVA inode number for a given path name.
+
+ * parity.c
+ Functions to compute file page parity bits. Each file page is striped in to
+ equally sized segments (or strips), and one parity strip is calculated using
+ RAID-5 method. A function to restore a broken data strip is also implemented
+ in this file.
+
+ * perf.{h,c}
+ Function performance measurements. It defines
+ function IDs and call prototypes. Measures primitive functions'
+ performance, including memory copy functions for DRAM and NVMM, checksum
+ functions, and XOR parity functions.
+
+ * rebuild.c
+ When mounting NOVA after a crash, rebuilds NOVA inodes from its logs. There
+ are also functions to re-calculate checksums and parity bits for file pages
+ that were mmapped during the crash.
+
+ * snapshot.{h,c}
+ Code and data structures for taking snapshots.
+
+ * stats.h
+ Defines data structures and macros that are relevant to gather NOVA usage
+ statistics.
+
+ * stats.c
+ Implements routines to gather and print NOVA usage statistics.
+
+ * super.{h,c}
+ Super block structures and Nova FS layout and entry points for NOVA
+ mounting and unmounting, initializing or recovering the NOVA super block
+ and other global file system information.
+
+ * symlink.c
+ Implements functions to create and read symbolic links in the filesystem.
+
+ * sysfs.c
+ Implements sysfs entries to take user inputs for taking snapshots, printing
+ NOVA statistics, and measuring function's performance.
+
+
+FS Layout
+======================
+
+A Nova file systems resides in single PMEM device. Nova divides the device int
+4KB blocks.
+
+ block
++-----------------------------------------------------+
+| 0 | primary super block (struct nova_super_block) |
++-----------------------------------------------------+
+| 1 | Reserved inodes |
++-----------------------------------------------------+
+| 2 | reserved |
++-----------------------------------------------------+
+| 3 | Journal pointers |
++-----------------------------------------------------+
+| 4-5 | Inode pointer tables |
++-----------------------------------------------------+
+| 6 | reserved |
++-----------------------------------------------------+
+| 7 | reserved |
++-----------------------------------------------------+
+| ... | data pages |
++-----------------------------------------------------+
+| n-2 | replica reserved Inodes |
++-----------------------------------------------------+
+| n-1 | replica super block |
++-----------------------------------------------------+
+
+
+
+Superblock and Associated Structures
+====================================
+
+The beginning of the PMEM device hold the super block and its associated
+tables. These include reserved inodes, a table of pointers to the journals
+Nova uses for complex operations, and pointers to inodes tables. Nova
+maintains replicas of the super block and reserved inodes in the last two
+blocks of the PMEM area.
+
+
+Block Allocator/Free Lists
+==========================
+
+Nova uses per-CPU allocators to manage free PMEM blocks. On initialization,
+NOVA divides the range of blocks in the PMEM device among the CPUs, and those
+blocks are managed solely by that CPU. We call these ranges of "allocation regions".
+
+Some of the blocks in an allocation region have fixed roles. Here's the
+layout:
+
++-------------------------------+
+| data checksum blocks |
++-------------------------------+
+| data parity blocks |
++-------------------------------+
+| |
+| Allocatable blocks |
+| |
++-------------------------------+
+| replica data parity blocks |
++-------------------------------+
+| replica data checksum blocks |
++-------------------------------+
+
+The first and last allocation regions, also contain the super block, inode
+tables, etc. and their replicas, respectively.
+
+Each allocator maintains a red-black tree of unallocated ranges (struct
+nova_range_node).
+
+Allocation Functions
+--------------------
+
+Nova allocate PMEM blocks using two mechanisms:
+
+1. Static allocation as defined in super.h
+
+2. Allocation for log and data pages via nova_new_log_blocks() and
+nova_new_data_blocks().
+
+Both of these functions allow the caller to control whether the allocator
+preferes higher addresses for allocation or lower addresses. We use this to
+encourage meta data structures and their replicas to be far from one another.
+
+PMEM Address Translation
+------------------------
+
+In Nova's persistent data structures, memory locations are given as offsets
+from the beginning of the PMEM region. nova_get_block() translates offsets to
+PMEM addresses. nova_get_addr_off() performs the reverse translation.
+
+
+Inodes
+======
+
+Nova maintains per-CPU inode tables, and inode numbers are striped across the
+tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
+
+The inodes themselves live in a set of linked lists (one per CPU) of 2MB
+blocks. The last 8 bytes of each block points to the next block. Pointers to
+heads of these list live in PMEM block INODE_TABLE0_START and are replicated in
+PMEM block INODE_TABLE1_START. Additional space for inodes is allocated on
+demand.
+
+To allocate inodes, Nova maintains a per-cpu "inuse_list" in DRAM holds a RB
+tree that holds ranges of unallocated inode numbers.
+
+Logs
+====
+
+Nova maintains a log for each inode that records updates to the inode's
+metadata and holds pointers to the file data. Nova makes updates to file data
+and metadata atomic by atomically appending log entries to the log.
+
+Each inode contains pointers to head and tail of the inode's log. When the log
+grows past the end of the last page, nova allocates additional space. For
+short logs (less than 1MB) , it doubles the length. For longer logs, it adds a
+fixed amount of additional space (1MB).
+
+Log space is reclaimed during garbage collection.
+
+Log Entries
+-----------
+
+There are eight kinds of log entry, documented in log.h. The log entries have
+several entries in common:
+
+ 1. 'epoch_id' gives the epoch during which the log entry was created.
+ Creating a snapshot increiments the epoch_id for the file systems.
+
+ 2. 'trans_id' is filesystem-wide, monotone increasing, number assigned each
+ log entry. It provides an ordering over all FS operations.
+
+ 3. 'invalid' is true if the effects of this entry are dead and the log
+ entry can be garbage collected.
+
+ 4. 'csum' is a CRC32 checksum for the entry.
+
+Log structure
+-------------
+
+The logs comprise a linked list of PMEM blocks. The tail of each block
+
+contains some metadata about the block and pointers to the next block and
+block's replica (struct nova_inode_page_tail).
+
++----------------+
+| log entry |
++----------------+
+| log entry |
++----------------+
+| ... |
++----------------+
+| tail |
+| metadata |
+| -> next block |
++----------------+
+
+
+Journals
+========
+
+Nova uses a lightweight journaling mechanisms to provide atomicity for
+operations that modify more than one on inode. The journals providing logging
+for two operations:
+
+1. Single word updates (JOURNAL_ENTRY)
+2. Copying inodes (JOURNAL_INODE)
+
+The journals are undo logs: Nova creates the journal entries for an operation,
+and if the operation does not complete due to a system failure, the recovery
+process rolls back the changes using the journal entries.
+
+To commit, Nova drops the log.
+
+Nova maintains one journal per CPU. The head and tail pointers for each
+journal live in a reserved page near the beginning of the file system.
+
+During recovery, Nova scans the journals and undoes the operations described by
+each entry.
+
+
+File and Directory Access
+=========================
+
+To access file data via read(), Nova maintains a radix tree in DRAM for each
+inode (nova_inode_info_header.tree) that maps file offsets to write log
+entries. For directories, the same tree maps a hash of filenames to their
+corresponding dentry.
+
+In both cases, the nova populates the tree when the file or directory is opened
+by scanning its log.
+
+MMap and DAX
+============
+
+NOVA leverages the kernel's DAX mechanisms for mmap and file data access. Nova
+maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track
+which portions of a file have been mapped.
+
+Garbage Collection
+==================
+
+Nova recovers log space with a two-phase garbage collection system. When a log
+reaches the end of its allocated pages, Nova allocates more space. Then, the
+fast GC algorithm scans the log to remove pages that have no valid entries.
+Then, it estimates how many pages the logs valid entries would fill. If this
+is less than half the number of pages in the log, the second GC phase copies
+the valid entries to new pages.
+
+For example (V=valid; I=invalid):
+
++---+ +---+ +---+
+| I | | I | | V |
++---+ +---+ Thorough +---+
+| V | | V | GC | V |
++---+ +---+ =====> +---+
+| I | | I | | V |
++---+ +---+ +---+
+| V | | V | | V |
++---+ +---+ +---+
+ | |
+ V V
++---+ +---+
+| I | | V |
++---+ +---+
+| I | fast GC | I |
++---+ ====> +---+
+| I | | I |
++---+ +---+
+| I | | V |
++---+ +---+
+ |
+ V
++---+
+| V |
++---+
+| I |
++---+
+| I |
++---+
+| V |
++---+
+
+
+Replication and Checksums
+=========================
+
+Nova protects data and metadat from corruption due to media errors and
+"scribbles" -- software errors in the kernels that may overwrite Nova data.
+
+Replication
+-----------
+
+Nova replicates all PMEM metadata structures (there are a few exceptions. They
+are WIP). For structure, there is a primary and an "alternate" (denoted as
+"alter" in the code). To ensure that Nova can recover a consistent copy of the
+data in case of a failure, Nova first updates the primary, and issues a persist
+barrier to ensure that data is written to NVMM. Then it does the same for the
+alternate.
+
+Detection
+---------
+
+Nova uses two techniques to detect data corruption. For media errors, Nova
+should always uses memcpy_from_pmem() to read data from PMEM, usually by
+copying the PMEM data structure into DRAM.
+
+To detect software-caused corruption, Nova uses CRC32 checksums. All the PMEM
+data structures in Nova include csum field for this purpose. Nova also
+computes CRC32 checksums each 512-byte slice of each data page.
+
+The checksums are stored in dedicated pages in each CPU's allocation region.
+
+ replica
+ parity parity
+ page page
+ +---+---+---+---+---+---+---+---+ +---+ +---+
+data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | | 0 | | 0 |
+ +---+---+---+---+---+---+---+---+ +---+ +---+
+data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | | 1 | | 1 |
+ +---+---+---+---+---+---+---+---+ +---+ +---+
+data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | | 0 | | 0 |
+ +---+---+---+---+---+---+---+---+ +---+ +---+
+data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | | 0 | | 0 |
+ +---+---+---+---+---+---+---+---+ +---+ +---+
+ ... ... ... ...
+
+Recovery
+--------
+
+Nova uses replication to support recovery of metadata structures and
+RAID4-style parity to recover corrupted data.
+
+If Nova detects corruption of a metadata structure, it restores the structure
+using the replica.
+
+If it detects a corrupt slice of data page, it uses RAID4-style recovery to
+restore it. The CRC32 checksums for the page slices are replicated.
+
+Cautious allocation
+-------------------
+
+To maximize its resilience to software scribbles, Nova allocate metadata
+structures and their replicas far from one another. It tries to allocate the
+primary copy at a low address and the replica at a high address within the PMEM
+region.
+
+Write Protection
+----------------
+
+Finally, Nova supports can prevent unintended writes PMEM by mapping the entire
+PMEM device as read-only and then disabling _all_ write protection by clearing
+the WP bit the CR0 control register when Nova needs to perform a write. The
+wprotect mount-time option controls this behavior.
+
+To map the PMEM device as read-only, we have added a readonly module command
+line option to nd_pmem. There is probably a better approach to achieving this
+goal.
+
+Unsafe modes
+============
+
+Nova support modes that disable some of the protections it provides to improve
+perforamnce.
+
+File data
+---------
+
+Nova can disable parity and/or checksums on file data (options 'data_parity=0'
+and 'data_checksum=0'). Without parity, Nova can detect but not recover from
+data corruption. Without checksums, Nova will still detect and recover from
+media errors, but not scribbles.
+
+Nova also supports in-place file updates (option: inplace_data_updates=1).
+This breaks atomicity for writes, but improve performance, especially for
+sub-page writes, since these require a full page COW in the default mode.
+
+Metadata
+--------
+
+Nova can disable metadata checksums and replication (option 'metadata_csum=0').
+
+
+Snapshots
+=========
+
+Nova supports snapshots to facilitate backups.
+
+Taking a snapshot
+-----------------
+
+Each Nova file systems has a current epoch_id in the super block and each log
+entry has the epoch_id attached to it at creation. When the user creates a
+snaphot, Nova increments the epoch_id for the file system and the old epoch_id
+identifies the moment the snapshot was taken.
+
+Nova records the epoch_id and a timestamp in a new log entry (struct
+snapshot_info_log_entry) and appends it to the log of the reserved snapshot
+inode (NOVA_SNAPSHOT_INODE) in the superblock.
+
+Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
+snapshot_info in DRAM indexed by epoch_id.
+
+Nova also marks all mmap'd pages as read-only and uses COW to preserve file
+contents after the snapshot.
+
+Tracking Live Data
+------------------
+
+Supporting snapshots requires Nova to preserve file contents from previous
+snapshots while also being able to recover the space a snapshot occupied after
+its deletion.
+
+Preserving file contents requires a small change to how Nova implements write
+operations. To perform a write, Nova appends a write log entry to the file's
+log. The log entry includes pointers to newly-allocated and populated NVMM
+pages that hold the written data. If the write overwrites existing data, Nova
+locates the previous write log entry for that portion of the file, and performs
+an "epoch check" that compares the old log entry's epoch_id to the file
+system's current epoch_id. If the comparison matches, the old write log entry
+and the file data blocks it points to no longer belong to any snapshot, and
+Nova reclaims the data blocks.
+
+If the epoch_id's do not match, then the data in the old log entry belongs to
+an earlier snapshot and Nova leaves the log entry in place.
+
+Determining when to reclaim data belonging to deleted snapshots requires
+additional bookkeeping. For each snapshot, Nova maintains a "snapshot log"
+that records the inodes and blocks that belong to that snapshot, but are not
+part of the current file system image.
+
+Nova populates the snapshot log during the epoch check: If the epoch_ids for
+the new and old log entries do not match, it appends a log entry (either struct
+snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
+that the old log entry belongs to. The log entry contains a pointer to the old
+log entry, and the filesystem's current epoch_id as the delete_epoch_id.
+
+To delete a snapshot, Nova removes the snapshot from the list of live snapshots
+and appends its log to the following snapshot's log. Then, a background thread
+traverses the combined log and reclaims dead inode/data based on the delete
+epoch_id: If the delete epoch_id for an entry in the log is less than or equal
+to the snapshot's epoch_id, it means the log entry and/or the associated data
+blocks are now dead.
+
+Snapshots and DAX
+-----------------
+
+Taking consistent snapshots while applications are modifying files using
+DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
+become persistent (i.e., reach physical NVMM so they will survive a system
+failure). These applications rely on the processor's ``memory persistence
+model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
+about when and in what order stores become persistent. These guarantees allow
+the application to restore their data to a consistent state during recovery
+from a system failure.
+
+From the application's perspective, reading a snapshot is equivalent to
+recovering from a system failure. In both cases, the contents of the
+memory-mapped file reflect its state at a moment when application operations
+might be in-flight and when the application had no chance to shut down cleanly.
+
+A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
+of the read/write mapped pages as read-only and then do copy-on-write when a
+store occurs to preserve the old pages as part of the snapshot.
+
+However, this approach can leave the snapshot in an inconsistent state:
+Setting the page to read-only captures its contents for the
+snapshot, and the kernel requires NOVA to set the pages as read-only
+one at a time. So, if the order in which NOVA marks pages as read-only
+is incompatible with ordering that the application requires, the snapshot will
+contain an inconsistent version of the file.
+
+To resolve this problem, when NOVA starts marking pages as read-only, it blocks
+page faults to the read-only mmap()'d pages until it has marked all the pages
+read-only and finished taking the snapshot.
+
+More detail is available in the technical report referenced at the top of this
+document.
+
+We have implemented this functionality in NOVA by adding the 'original_write'
+flag to struct vm_area_struct that tracks whether the vm_area_struct is created
+with write permission, but has been marked read-only in the course of taking a
+snapshot. We have also added a 'dax_cow' operation to struct
+vm_operations_struct that the page fault handler runs when applications write
+to a page with original_write = 1. NOVA's dax_cow operation
+(nova_restore_page_write()) performs the COW, maps the page to a new physical
+page and allows writing.
+
+Saving Snapshot State
+---------------------
+
+During a clean shutdown, Nova stores the snapshot information to PMEM.
+
+Nova reserves an inode for storing snapshot information. The log for the inode
+contains an entry for each snapshot (struct snapshot_info_log_entry). On
+shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
+of struct snapshot_nvmm_list.
+
+Each of these lists (one per CPU) contains head and tail pointers to a linked
+list of blocks (just like an inode log). The lists contain a struct
+snapshot_file_write_entry or struct snapshot_inode_entry for each operation
+that modified file data or an inode.
+
+Superblock
++--------------------+
+| ... |
++--------------------+
+| Reserved Inodes |
++---+----------------+
+| | ... |
++---+----------------+
+| 7 | Snapshot Inode |
+| | head |
++---+----------------+
+ /
+ /
+ /
++---------+---------+---------+
+| Snap | Snap | Snap |
+| epoch=1 | epoch=4 | epoch=11|
+| | | |
+|nvmm_page|nvmm_page|nvmm_page|
++---------+---------+---------+
+ |
+ |
++----------+ +--------+--------+
+| cpu 0 | | snap | snap |
+| head |-->| inode | write |
+| | | entry | entry |
+| | +--------+--------+
++----------+ +--------+--------+
+| cpu 1 | | snap | snap |
+| head |-->| write | write |
+| | | entry | entry |
+| | +--------+--------+
++----------+
+| ... |
++----------+ +--------+
+| cpu 128 | | snap |
+| head |-->| inode |
+| | | entry |
+| | +--------+
++----------+
+
+
+Umount and Recovery
+===================
+
+Clean umount/mount
+------------------
+
+On a clean unmount, Nova saves the contents of many of its DRAM data structures
+to PMEM to accelerate the next mount:
+
+1. Nova stores the allocator state for each of the per-cpu allocators to the
+ log of a reserved inode (NOVA_BLOCK_NODE_INO).
+
+2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the
+ NOVA_BLOCK_INODELIST1_INO reserved inode.
+
+3. Nova stores the snapshot state to PMEM as described above.
+
+After a clean unmount, the following mount restores these data and then
+invalidates them.
+
+Recovery after failures
+------------------------
+
+In case of a unclean dismount (e.g., system crash), Nova must rebuild these
+DRAM structures by scanning the inode logs. Nova log scanning is fast because
+per-CPU inode tables and per-inode logs allow for parallel recovery.
+
+The number of live log entries in an inode log is roughly the number of extents
+in the file. As a result, Nova only needs to scan a small fraction of the NVMM
+during recovery.
+
+The Nova failure recovery consists of two steps:
+
+First, Nova checks its lite weight journals and rolls back any uncommitted
+transactions to restore the file system to a consistent state.
+
+Second, Nova starts a recovery thread on each CPU and scans the inode tables in
+parallel, performing log scanning for every valid inode in the inode table.
+Nova use different recovery mechanisms for directory inodes and file inodes:
+For a directory inode, Nova scans the log's linked list to enumerate the pages
+it occupies, but it does not inspect the log's contents. For a file inode,
+Nova reads the write entries in the log to enumerate the data pages.
+
+During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds
+the allocator based on the result. After this process completes, the file
+system is ready to accept new requests.
+
+During the same scan, it rebuilds the snapshot information and the list
+available inodes.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 767e9d202adf..cfcee556acc6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9108,6 +9108,14 @@ F: drivers/power/supply/bq27xxx_battery_i2c.c
F: drivers/power/supply/isp1704_charger.c
F: drivers/power/supply/rx51_battery.c
+NOVA FILE SYSTEM
+M: Andiry Xu <jix024@...ucsd.edu>
+M: Steven Swanson <swanson@...ucsd.edu>
+L: linux-fsdevel@...r.kernel.org
+L: linux-nvdimm@...ts.01.org
+F: Documentation/filesystems/nova.txt
+F: fs/nova/
+
NTB DRIVER CORE
M: Jon Mason <jdmason@...zu.us>
M: Dave Jiang <dave.jiang@...el.com>
diff --git a/README.md b/README.md
new file mode 100644
index 000000000000..4f778e99a79e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,173 @@
+# NOVA: NOn-Volatile memory Accelerated log-structured file system
+
+NOVA's goal is to provide a high-performance, full-featured, production-ready
+file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
+and Intel's soon-to-be-released 3DXpoint DIMMs). It combines design elements
+from many other file systems to provide a combination of high-performance,
+strong consistency guarantees, and comprehensive data protection. NOVA support
+DAX-style mmap and making DAX performs well is a first-order priority in NOVA's
+design. NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in
+the [Computer Science and Engineering Department][CSE] at the [University of
+California, San Diego][UCSD].
+
+
+NOVA is primarily a log-structured file system, but rather than maintain a
+single global log for the entire file system, it maintains separate logs for
+each file (inode). NOVA breaks the logs into 4KB pages, they need not be
+contiguous in memory. The logs only contain metadata.
+
+File data pages reside outside the log, and log entries for write operations
+point to data pages they modify. File modification uses copy-on-write (COW) to
+provide atomic file updates.
+
+For file operations that involve multiple inodes, NOVA use small, fixed-sized
+redo logs to atomically append log entries to the logs of the inodes involned.
+
+This structure keeps logs small and make garbage collection very fast. It also
+enables enormous parallelism during recovery from an unclean unmount, since
+threads can scan logs in parallel.
+
+NOVA replicates and checksums all metadata structures and protects file data
+with RAID-4-style parity. It supports checkpoints to facilitate backups.
+
+A more thorough discussion of NOVA's design is avaialable in these two papers:
+
+**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories**
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)<br>
+*Jian Xu and Steven Swanson*<br>
+Published in [FAST 2016][FAST2016]
+
+**Hardening the NOVA File System**
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) <br>
+UCSD-CSE Techreport CS2017-1018
+*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*<br>
+
+Read on for further details about NOVA's overall design and its current status
+
+### Compatibilty with Other File Systems
+
+NOVA aims to be compatible with other Linux file systems. To help verify that it achieves this we run several test suites against NOVA each night.
+
+* The latest version of XFSTests. ([Current failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests))
+* The (Linux testing project)(https://linux-test-project.github.io/) file system tests.
+* The (fstest POSIX test suite)[POSIXtest].
+
+Currently, nearly all of these tests pass for the `master` branch, and we have
+run complex programs on NOVA. There are, of course, many bugs left to fix.
+
+NOVA uses the standard PMEM kernel interfaces for accessing and managing
+persistent memory.
+
+### Atomicity
+
+By default, NOVA makes all metadata and file data operations atomic.
+
+Strong atomicity guarantees make it easier to build reliable applications on
+NOVA, and NOVA can provide these guarantees with sacrificing much performance
+because NVDIMMs support very fast random access.
+
+NOVA also supports "unsafe data" and "unsafe metadata" modes that
+improve performance in some cases and allows for non-atomic updates of file
+data and metadata, respectively.
+
+### Data Protection
+
+NOVA aims to protect data against both misdirected writes in the kernel (which
+can easily "scribble" over the contents of an NVDIMM) as well as media errors.
+
+NOVA protects all of its metadata data structures with a combination of
+replication and checksums. It protects file data using RAID-5 style parity.
+
+NOVA can detects data corruption by verifying checksums on each access and by
+catching and handling machine check exceptions (MCEs) that arise when the
+system's memory controller detects at uncorrectable media error.
+
+We use a fault injection tool that allows testing of these recovery mechanisms.
+
+To facilitate backups, NOVA can take snapshots of the current filesystem state
+that can be mounted read-only while the current file system is mounted
+read-write.
+
+The tech report list above describes the design of NOVA's data protection system in detail.
+
+### DAX Support
+
+Supporting DAX efficiently is a core feature of NOVA and one of the challenges
+in designing NOVA is reconciling DAX support which aims to avoid file system
+intervention when file data changes, and other features that require such
+intervention.
+
+NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
+to modify a file, the program must take full responsibility for that data and
+NOVA must ensure that the memory will behave as expected. At other times, the
+file system provides protection. This approach has several implications:
+
+1. Implementing `msync()` in user space works fine.
+
+2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
+mechanism, because protecting it would be too expensive. When the file is
+unmapped and/or during file system recovery, protection is restored.
+
+3. The snapshot mechanism must be careful about the order in which in adds
+pages to the file's snapshot image.
+
+### Performance
+
+The research paper and technical report referenced above compare NOVA's
+performance to other file systems. In almost all cases, NOVA outperforms other
+DAX-enabled file systems. A notable exception is sub-page updates which incur
+COW overheads for the entire page.
+
+The technical report also illustrates the trade-offs between our protection
+mechanisms and performance.
+
+## Gaps, Missing Features, and Development Status
+
+Although NOVA is a fully-functional file system, there is still much work left
+to be done. In particular, (at least) the following items are currently missing:
+
+1. There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA file system)
+2. NOVA doesn't scrub data to prevent corruption from accumulating in infrequently accessed data.
+3. NOVA doesn't read bad block information on mount and attempt recovery of the effected data.
+4. NOVA only works on x86-64 kernels.
+5. NOVA does not currently support extended attributes or ACL.
+6. NOVA does not currently prevent writes to mounted snapshots.
+7. Using `write()` to modify pages that are mmap'd is not supported.
+8. NOVA deoesn't provide quota support.
+9. Moving NOVA file systems between machines with different numbers of CPUs does not work.
+10. Remounting a NOVA file system with different mount options may fail.
+
+None of these are fundamental limitations of NOVA's design. Additional bugs
+and issues are here [here][https://github.com/NVSL/linux-nova/issues].
+
+NOVA is complete and robust enough to run a range of complex applications, but
+it is not yet ready for production use. Our current focus is on adding a few
+missing features list above and finding/fixing bugs.
+
+## Building and Using NOVA
+
+This repo contains a version of the Linux with NOVA included. You should be
+able to build and install it just as you would the mainline Linux source.
+
+### Building NOVA
+
+To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual.
+
+## Hacking and Contributing
+
+The NOVA source code is almost completely contains in the `fs/nova` directory.
+The execptions are some small changes in the kernel's memory management system
+to support checkpointing.
+
+`Documentation/filesystems/nova.txt` describes the internals of Nova in more detail.
+
+If you find bugs, please [report them](https://github.com/NVSL/linux-nova/issues).
+
+If you have other questions or suggestions you can contact the NOVA developers at [cse-nova-hackers@....ucsd.edu](mailto:cse-nova-hackers@....ucsd.edu).
+
+
+[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu"
+[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/
+[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions
+[CSE]: http://cs.ucsd.edu
+[UCSD]: http://www.ucsd.edu
\ No newline at end of file
Powered by blists - more mailing lists