lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240805093245.889357-1-jgowans@amazon.com>
Date: Mon, 5 Aug 2024 11:32:35 +0200
From: James Gowans <jgowans@...zon.com>
To: <linux-kernel@...r.kernel.org>
CC: James Gowans <jgowans@...zon.com>, Sean Christopherson
	<seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, Alexander Viro
	<viro@...iv.linux.org.uk>, Steve Sistare <steven.sistare@...cle.com>,
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, "Anthony
 Yznaga" <anthony.yznaga@...cle.com>, Mike Rapoport <rppt@...nel.org>, "Andrew
 Morton" <akpm@...ux-foundation.org>, <linux-mm@...ck.org>, Jason Gunthorpe
	<jgg@...pe.ca>, <linux-fsdevel@...r.kernel.org>, Usama Arif
	<usama.arif@...edance.com>, <kvm@...r.kernel.org>, Alexander Graf
	<graf@...zon.com>, David Woodhouse <dwmw@...zon.co.uk>, Paul Durrant
	<pdurrant@...zon.co.uk>, Nicolas Saenz Julienne <nsaenz@...zon.es>
Subject: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem

In this patch series a new in-memory filesystem designed specifically
for live update is implemented. Live update is a mechanism to support
updating a hypervisor in a way that has limited impact to running
virtual machines. This is done by pausing/serialising running VMs,
kexec-ing into a new kernel, starting new VMM processes and then
deserialising/resuming the VMs so that they continue running from where
they were. To support this, guest memory needs to be preserved.

Guestmemfs implements preservation acrosss kexec by carving out a large
contiguous block of host system RAM early in boot which is then used as
the data for the guestmemfs files. As well as preserving that large
block of data memory across kexec, the filesystem metadata is preserved
via the Kexec Hand Over (KHO) framework (still under review):
https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/

Filesystem metadata is structured to make preservation across kexec
easy: inodes are one large contiguous array, and each inode has a
"mappings" block which defines which block from the filesystem data
memory corresponds to which offset in the file.

There are additional constraints/requirements which guestmemfs aims to
meet:

1. Secret hiding: all filesystem data is removed from the kernel direct
map so immune from speculative access. read()/write() are not supported;
the only way to get at the data is via mmap.

2. Struct page overhead elimination: the memory is not managed by the
buddy allocator and hence has no struct pages.

3. PMD and PUD level allocations for TLB performance: guestmemfs
allocates PMD-sized pages to back files which improves TLB perf (caveat
below!). PUD size allocations are a next step.

4. Device assignment: being able to use guestmemfs memory for
VFIO/iommufd mappings, and allow those mappings to survive and continue
to be used across kexec.


Next steps
=========

The idea is that this patch series implements a minimal filesystem to
provide the foundations for in-memory persistent across kexec files.
One this foundation is in place it will be extended:

1. Improve the filesystem to be more comprehensive - currently it's just
functional enough to demonstrate the main objective of reserved memory
and persistence via KHO.

2. Build support for iommufd IOAS and HWPT persistence, and integrate
that with guestmemfs. The idea is that if VMs have DMA devices assigned
to them, DMA should continue running across kexec. A future patch series
will add support for this in iommufd and connect iommufd to guestmemfs
so that guestmemfs files can remain mapped into the IOMMU during kexec.

3. Support a guest_memfd interface to files so that they can be used for
confidential computing without needing to mmap into userspace.

3. Gigantic PUD level mappings for even better TLB perf.

Caveats
=======

There are a issues with the current implementation which should be
solved either in this patch series or soon in follow-on work:

1. Although PMD-size allocations are done, PTE-level page tables are
still created. This is because guestmemfs uses remap_pfn_range() to set
up userspace pgtables. Currently remap_pfn_range() only creates
PTE-level mappings. I suggest enhancing remap_pfn_range() to support
creating higher level mappings where possible, by adding pmd_special
and pud_special flags.

2. NUMA support is currently non-existent. To make this more generally
useful it's necessary to have NUMA-awareness. One thought on how to do
this is to be able to specify multiple allocations with wNUMA affinity
on the kernel cmdline and have multiple mount points, one per NUMA node.
Currently, for simplicity, only a single contiguous filesystem data
allocation and a single mount point is supported.

3. MCEs are currently not handled - we need to add functionality for
this to be able to track block ownership and deliver an MCE correctly.

4. Looking for reviews from filesystem experts to see if necessary
callbacks, refcounting, locking, etc, is done correctly.

Open questions
==============

It is not too clear if or how guestmemfs should use DAX as a source of
memory. Seeing as guestmemfs has an in-memory design, it seems that it
is not necessary to use DAX as a source of memory, but I am keen for
guidance/input on whether DAX should be used here.

The filesystem data memory is removed from the direct map for secret
hiding, but it is still necessary to mmap it to be accessible to KVM.
For improving secret hiding even more a guest_memfd-style interface
could be used to remove the need to mmap. That introduces a new problem
of the memory being completely inaccessible to KVM for this like MMIO
instruction emulation. How can this be handled?

Related Work
============

There are similarities to a few attempts at solving aspects of this
problem previously.

The original was probably PKRAM from Oracle; a tempfs filesystem with
persistence:
https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
guestmemfs will additionally provide secret hiding, PMD/PUD allocations
and a path to DMA persistence and NUMA support.

Dmemfs from Tencent aimed to remove the need for struct page overhead:
https://lore.kernel.org/kvm/cover.1602093760.git.yuleixzhang@tencent.com/
Guestmemfs provides this benefit too, along with persistence across
kexec and secret hiding. 

Pkernfs attempted to solve guest memory persistence and IOMMU
persistence all in one:
https://lore.kernel.org/all/20240205120203.60312-1-jgowans@amazon.com/
Guestmemfs is a re-work of that to only persist guest RAM in the
filesystem, and to use KHO for filesystem metadata. IOMMU persistence
will be implemented independently with persistent iommufd domains via
KHO.

Testing
=======

The testing for this can be seen in the Documentation file in this patch
series. Essentially it is using a guestmemfs file for a QEMU VM's RAM,
doing a kexec, restoring the QEMU VM and confirming that the VM picked
up from where it left off.

James Gowans (10):
  guestmemfs: Introduce filesystem skeleton
  guestmemfs: add inode store, files and dirs
  guestmemfs: add persistent data block allocator
  guestmemfs: support file truncation
  guestmemfs: add file mmap callback
  kexec/kho: Add addr flag to not initialise memory
  guestmemfs: Persist filesystem metadata via KHO
  guestmemfs: Block modifications when serialised
  guestmemfs: Add documentation and usage instructions
  MAINTAINERS: Add maintainers for guestmemfs

 Documentation/filesystems/guestmemfs.rst |  87 +++++++
 MAINTAINERS                              |   8 +
 arch/x86/mm/init_64.c                    |   2 +
 fs/Kconfig                               |   1 +
 fs/Makefile                              |   1 +
 fs/guestmemfs/Kconfig                    |  11 +
 fs/guestmemfs/Makefile                   |   8 +
 fs/guestmemfs/allocator.c                |  40 +++
 fs/guestmemfs/dir.c                      |  43 ++++
 fs/guestmemfs/file.c                     | 106 ++++++++
 fs/guestmemfs/guestmemfs.c               | 160 ++++++++++++
 fs/guestmemfs/guestmemfs.h               |  60 +++++
 fs/guestmemfs/inode.c                    | 189 ++++++++++++++
 fs/guestmemfs/serialise.c                | 302 +++++++++++++++++++++++
 include/linux/guestmemfs.h               |  16 ++
 include/uapi/linux/kexec.h               |   6 +
 kernel/kexec_kho_in.c                    |  12 +-
 kernel/kexec_kho_out.c                   |   4 +
 18 files changed, 1055 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/filesystems/guestmemfs.rst
 create mode 100644 fs/guestmemfs/Kconfig
 create mode 100644 fs/guestmemfs/Makefile
 create mode 100644 fs/guestmemfs/allocator.c
 create mode 100644 fs/guestmemfs/dir.c
 create mode 100644 fs/guestmemfs/file.c
 create mode 100644 fs/guestmemfs/guestmemfs.c
 create mode 100644 fs/guestmemfs/guestmemfs.h
 create mode 100644 fs/guestmemfs/inode.c
 create mode 100644 fs/guestmemfs/serialise.c
 create mode 100644 include/linux/guestmemfs.h

-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ