linux-kernel - Re: [PATCH v7 1/2] fs: New zonefs file system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2aad064-a8a8-e1fa-28ca-b18c5be1f0b7@suse.de>
Date:   Wed, 15 Jan 2020 10:36:05 +0100
From:   Hannes Reinecke <hare@...e.de>
To:     Damien Le Moal <damien.lemoal@....com>,
        linux-fsdevel@...r.kernel.org, linux-xfs@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Johannes Thumshirn <jth@...nel.org>,
        Naohiro Aota <naohiro.aota@....com>,
        "Darrick J . Wong" <darrick.wong@...cle.com>
Subject: Re: [PATCH v7 1/2] fs: New zonefs file system

On 1/15/20 7:28 AM, Damien Le Moal wrote:
> zonefs is a very simple file system exposing each zone of a zoned block
> device as a file. Unlike a regular file system with zoned block device
> support (e.g. f2fs), zonefs does not hide the sequential write
> constraint of zoned block devices to the user. Files representing
> sequential write zones of the device must be written sequentially
> starting from the end of the file (append only writes).
> 
> As such, zonefs is in essence closer to a raw block device access
> interface than to a full featured POSIX file system. The goal of zonefs
> is to simplify the implementation of zoned block device support in
> applications by replacing raw block device file accesses with a richer
> file API, avoiding relying on direct block device file ioctls which may
> be more obscure to developers. One example of this approach is the
> implementation of LSM (log-structured merge) tree structures (such as
> used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
> to be stored in a zone file similarly to a regular file system rather
> than as a range of sectors of a zoned device. The introduction of the
> higher level construct "one file is one zone" can help reducing the
> amount of changes needed in the application as well as introducing
> support for different application programming languages.
> 
> Zonefs on-disk metadata is reduced to an immutable super block to
> persistently store a magic number and optional feature flags and
> values. On mount, zonefs uses blkdev_report_zones() to obtain the device
> zone configuration and populates the mount point with a static file tree
> solely based on this information. E.g. file sizes come from the device
> zone type and write pointer offset managed by the device itself.
> 
> The zone files created on mount have the following characteristics.
> 1) Files representing zones of the same type are grouped together
>    under a common sub-directory:
>      * For conventional zones, the sub-directory "cnv" is used.
>      * For sequential write zones, the sub-directory "seq" is used.
>   These two directories are the only directories that exist in zonefs.
>   Users cannot create other directories and cannot rename nor delete
>   the "cnv" and "seq" sub-directories.
> 2) The name of zone files is the number of the file within the zone
>    type sub-directory, in order of increasing zone start sector.
> 3) The size of conventional zone files is fixed to the device zone size.
>    Conventional zone files cannot be truncated.
> 4) The size of sequential zone files represent the file's zone write
>    pointer position relative to the zone start sector. Truncating these
>    files is allowed only down to 0, in which case, the zone is reset to
>    rewind the zone write pointer position to the start of the zone, or
>    up to the zone size, in which case the file's zone is transitioned
>    to the FULL state (finish zone operation).
> 5) All read and write operations to files are not allowed beyond the
>    file zone size. Any access exceeding the zone size is failed with
>    the -EFBIG error.
> 6) Creating, deleting, renaming or modifying any attribute of files and
>    sub-directories is not allowed.
> 7) There are no restrictions on the type of read and write operations
>    that can be issued to conventional zone files. Buffered, direct and
>    mmap read & write operations are accepted. For sequential zone files,
>    there are no restrictions on read operations, but all write
>    operations must be direct IO append writes. mmap write of sequential
>    files is not allowed.
> 
> Several optional features of zonefs can be enabled at format time.
> * Conventional zone aggregation: ranges of contiguous conventional
>   zones can be aggregated into a single larger file instead of the
>   default one file per zone.
> * File ownership: The owner UID and GID of zone files is by default 0
>   (root) but can be changed to any valid UID/GID.
> * File access permissions: the default 640 access permissions can be
>   changed.
> 
> The mkzonefs tool is used to format zoned block devices for use with
> zonefs. This tool is available on Github at:
> 
> git@...hub.com:damien-lemoal/zonefs-tools.git.
> 
> zonefs-tools also includes a test suite which can be run against any
> zoned block device, including null_blk block device created with zoned
> mode.
> 
> Example: the following formats a 15TB host-managed SMR HDD with 256 MB
> zones with the conventional zones aggregation feature enabled.
> 
> $ sudo mkzonefs -o aggr_cnv /dev/sdX
> $ sudo mount -t zonefs /dev/sdX /mnt
> $ ls -l /mnt/
> total 0
> dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
> dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
> 
> The size of the zone files sub-directories indicate the number of files
> existing for each type of zones. In this example, there is only one
> conventional zone file (all conventional zones are aggregated under a
> single file).
> 
> $ ls -l /mnt/cnv
> total 137101312
> -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
> 
> This aggregated conventional zone file can be used as a regular file.
> 
> $ sudo mkfs.ext4 /mnt/cnv/0
> $ sudo mount -o loop /mnt/cnv/0 /data
> 
> The "seq" sub-directory grouping files for sequential write zones has
> in this example 55356 zones.
> 
> $ ls -lv /mnt/seq
> total 14511243264
> -rw-r----- 1 root root 0 Nov 25 13:23 0
> -rw-r----- 1 root root 0 Nov 25 13:23 1
> -rw-r----- 1 root root 0 Nov 25 13:23 2
> ...
> -rw-r----- 1 root root 0 Nov 25 13:23 55354
> -rw-r----- 1 root root 0 Nov 25 13:23 55355
> 
> For sequential write zone files, the file size changes as data is
> appended at the end of the file, similarly to any regular file system.
> 
> $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
> 
> $ ls -l /mnt/seq/0
> -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
> 
> The written file can be truncated to the zone size, preventing any
> further write operation.
> 
> $ truncate -s 268435456 /mnt/seq/0
> $ ls -l /mnt/seq/0
> -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
> 
> Truncation to 0 size allows freeing the file zone storage space and
> restart append-writes to the file.
> 
> $ truncate -s 0 /mnt/seq/0
> $ ls -l /mnt/seq/0
> -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
> 
> Since files are statically mapped to zones on the disk, the number of
> blocks of a file as reported by stat() and fstat() indicates the size
> of the file zone.
> 
> $ stat /mnt/seq/0
>   File: /mnt/seq/0
>   Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
> Device: 870h/2160d      Inode: 50431       Links: 1
> Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
> Access: 2019-11-25 13:23:57.048971997 +0900
> Modify: 2019-11-25 13:52:25.553805765 +0900
> Change: 2019-11-25 13:52:25.553805765 +0900
>  Birth: -
> 
> The number of blocks of the file ("Blocks") in units of 512B blocks
> gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
> to the device zone size in this example. Of note is that the "IO block"
> field always indicates the minimum IO size for writes and corresponds
> to the device physical sector size.
> 
> This code contains contributions from:
> * Johannes Thumshirn <jthumshirn@...e.de>,
> * Darrick J. Wong <darrick.wong@...cle.com>,
> * Christoph Hellwig <hch@....de>,
> * Chaitanya Kulkarni <chaitanya.kulkarni@....com> and
> * Ting Yao <tingyao@...t.edu.cn>.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@....com>
> Reviewed-by: Darrick J. Wong <darrick.wong@...cle.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@....com>
> ---
>  MAINTAINERS                |    9 +
>  fs/Kconfig                 |    1 +
>  fs/Makefile                |    1 +
>  fs/zonefs/Kconfig          |    9 +
>  fs/zonefs/Makefile         |    4 +
>  fs/zonefs/super.c          | 1177 ++++++++++++++++++++++++++++++++++++
>  fs/zonefs/zonefs.h         |  175 ++++++
>  include/uapi/linux/magic.h |    1 +
>  8 files changed, 1377 insertions(+)
>  create mode 100644 fs/zonefs/Kconfig
>  create mode 100644 fs/zonefs/Makefile
>  create mode 100644 fs/zonefs/super.c
>  create mode 100644 fs/zonefs/zonefs.h
> 

Reviewed-by: Hannes Reinecke <hare@...e.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@...e.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer