[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20181210171318.16998-1-vgoyal@redhat.com>
Date: Mon, 10 Dec 2018 12:12:26 -0500
From: Vivek Goyal <vgoyal@...hat.com>
To: linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org
Cc: vgoyal@...hat.com, miklos@...redi.hu, stefanha@...hat.com,
dgilbert@...hat.com, sweil@...hat.com, swhiteho@...hat.com
Subject: [PATCH 00/52] [RFC] virtio-fs: shared file system for virtual machines
Hi,
Here are RFC patches for virtio-fs. Looking for feedback on this approach.
These patches should apply on top of 4.20-rc5. We have also put code for
various components here.
https://gitlab.com/virtio-fs
Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.
Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously. File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.
Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.
Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.
- Use fuse protocol (instead of 9p) for communication between guest
and host. Guest kernel will be fuse client and a fuse server will
run on host to serve the requests. Benchmark results (see below) are
encouraging and show this approach performs well (2x to 8x improvement
depending on test being run).
- For data access inside guest, mmap portion of file in QEMU address
space and guest accesses this memory using dax. That way guest page
cache is bypassed and there is only one copy of data (on host). This
will also enable mmap(MAP_SHARED) between guests.
- For metadata coherency, there is a shared memory region which contains
version number associated with metadata and any guest changing metadata
updates version number and other guests refresh metadata on next
access. This is still experimental and implementation is not complete.
How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).
DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.
By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).
These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.
HOWTO
======
We have put instructions on how to use it here.
https://virtio-fs.gitlab.io/
Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.
- cache=none
metadata, data and pathname lookup are not cached in guest. They are always
fetched from host and any changes are immediately pushed to host.
- cache=always
metadata, data and pathname lookup are cached in guest and never expire.
- cache=auto
metadata and pathname lookup cache expires after a configured amount of time
(default is 1 second). Data is cached while the file is open (close to open
consistency).
- writeback/no_writeback
These options control the writeback strategy. If writeback is disabled,
then normal writes will immediately be synchronized with the host fs. If
writeback is enabled, then writes may be cached in the guest until the file
is closed or an fsync(2) performed. This option has no effect on mmap-ed
writes or writes going through the DAX mechanism.
- shared/no_shared
These options control the use of the shared version table. If shared mode
is enabled then metadata and pathname lookup is cached in guest, but is
refreshed due to changes in another virtio-fs instance.
DAX
===
- dax can be turned on/off when mounting virtio-fs inside guest.
WHAT WORKS
==========
- As of now primarily cache options none, auto and always are working.
shared option is still being worked on.
- Dax on/off seems to work. It does not seem to be as fast as we were
expecting it to be. Still need to look into optimization opportunities.
TODO
====
- Complete "cache=shared" implementation.
- Look into improving performance for dax. It seems slow.
- Lot of bug fixing, cleanup and performance improvement.
RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).
https://github.com/pjd/pjdfstest
(one symlink test fails and that seems to be due xfs on host. Yet to
look into it).
- We have run some basic tests and compared with virtio-9p and it seems
to be faster. I ran "smallfile" utility and a simple fio job to test
mmap performance.
Test Setup
-----------
- A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
threads per core)
- Using a PCIE SSD at host as backing store.
- Created a VM with 16 VCPUS and 6GB memory. A 2GB cache window (for dax
mmap).
fio mmap
--------
Wrote simple fio job to run mmap and READ. Ran test on 1 file and 4
files and different caching modes. File size is 4G. Dropped cache in
guest before each run. Cache on host was untouched. So data on host must
have been cached. These results are average of 3 runs.
cache mode 1-file(one thread) 4-files(4 threads)
virtio-9p mmap 28 MB/s 140 MB/s
virtio-fs none + dax 126 MB/s 501 MB/s
virtio-9p loose 31 MB/s 135 MB/s
virtio-fs always 235 MB/s 858 MB/s
virtio-fs always + dax 121 MB/s 487 MB/s
smallfile
---------
https://github.com/distributed-system-analysis/smallfile
I basically ran bunch of operations like create, ls-l, read, append,
rename and delete-renamed and measured performance over 3 runs and
took average. Dropped cache after before each operation started
running. Used effectively following command for each operation.
# python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top <test-dir>
cache mode operation (files/sec)
virtio-9p none create 194
virtio-fs none create 714
virtio-9p mmap create 201
virtio-fs none + dax create 759
virtio-9p loose create 16
virtio-fs always create 685
virtio-fs always + dax create 735
virtio-9p none ls-l 2038
virtio-fs none ls-l 4615
virtio-9p mmap ls-l 2087
virtio-fs none + dax ls-l 4616
virtio-9p loose ls-l 1619
virtio-fs always ls-l 13571
virtio-fs always + dax ls-l 12626
virtio-9p none read 199
virtio-fs none read 1405
virtio-9p mmap read 203
virtio-fs none + dax read 1345
virtio-9p loose read 207
virtio-fs always read 1436
virtio-fs always + dax read 1368
virtio-9p none append 197
virtio-fs none append 717
virtio-9p mmap append 200
virtio-fs none + dax append 645
virtio-9p loose append 16
virtio-fs always append 651
virtio-fs always + dax append 704
virtio-9p none rename 2442
virtio-fs none rename 5797
virtio-9p mmap rename 2518
virtio-fs none + dax rename 6386
virtio-9p loose rename 4178
virtio-fs always rename 15834
virtio-fs always + dax rename 15529
Thanks
Vivek
Dr. David Alan Gilbert (5):
virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find
them
virito-fs: Make dax optional
virtio: Free fuse devices on umount
virtio-fs: Retrieve shm capabilities for version table
virtio-fs: Map using the values from the capabilities
Miklos Szeredi (8):
fuse: simplify fuse_fill_super_common() calling
fuse: delete dentry if timeout is zero
fuse: multiplex cached/direct_io/dax file operations
virtio-fs: pass version table pointer to fuse
fuse: don't crash if version table is NULL
fuse: add shared version support (virtio-fs only)
fuse: shared version cleanups
fuse: fix fuse_permission() for the default_permissions case
Stefan Hajnoczi (17):
fuse: add skeleton virtio_fs.ko module
fuse: add probe/remove virtio driver
fuse: rely on mutex_unlock() barrier instead of fput()
fuse: extract fuse_fill_super_common()
virtio_fs: get mount working
fuse: export fuse_end_request()
fuse: export fuse_len_args()
fuse: add fuse_iqueue_ops callbacks
fuse: process requests queues
fuse: export fuse_get_unique()
fuse: implement FUSE_FORGET for virtio-fs
virtio_fs: Set up dax_device
dax: remove block device dependencies
fuse: add fuse_conn->dax_dev field
fuse: map virtio_fs DAX window BAR
fuse: Implement basic DAX read/write support commands
fuse: add DAX mmap support
Vivek Goyal (22):
virtio-fs: Retrieve shm capabilities for cache
virtio-fs: Map cache using the values from the capabilities
Limit number of pages returned by direct_access()
fuse: Introduce fuse_dax_mapping
Create a list of free memory ranges
fuse: Introduce setupmapping/removemapping commands
Introduce interval tree basic data structures
fuse: Maintain a list of busy elements
Do fallocate() to grow file before mapping for file growing writes
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Add logic to free up a memory range
fuse: Add logic to do direct reclaim of memory
fuse: Kick worker when free memory drops below 20% of total ranges
Dispatch FORGET requests later instead of dropping them
Release file in process context
fuse: Do not block on inode lock while freeing memory range
fuse: Reschedule dax free work if too many EAGAIN attempts
fuse: Wait for memory ranges to become free
fuse: Take inode lock for dax inode truncation
fuse: Clear setuid bit even in direct I/O path
drivers/dax/super.c | 3 +-
fs/dax.c | 23 +-
fs/ext4/inode.c | 2 +-
fs/fuse/Kconfig | 11 +
fs/fuse/Makefile | 1 +
fs/fuse/cuse.c | 3 +-
fs/fuse/dev.c | 80 ++-
fs/fuse/dir.c | 282 +++++++--
fs/fuse/file.c | 1012 +++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 234 ++++++-
fs/fuse/inode.c | 278 ++++++--
fs/fuse/readdir.c | 12 +-
fs/fuse/virtio_fs.c | 1336 +++++++++++++++++++++++++++++++++++++++
fs/splice.c | 3 +-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 6 +-
include/linux/fs.h | 2 +
include/uapi/linux/fuse.h | 39 ++
include/uapi/linux/virtio_fs.h | 46 ++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_pci.h | 10 +
21 files changed, 3151 insertions(+), 235 deletions(-)
create mode 100644 fs/fuse/virtio_fs.c
create mode 100644 include/uapi/linux/virtio_fs.h
--
2.13.6
Powered by blists - more mailing lists