[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250828093902.2719-1-roypat@amazon.co.uk>
Date: Thu, 28 Aug 2025 09:39:14 +0000
From: "Roy, Patrick" <roypat@...zon.co.uk>
To: "david@...hat.com" <david@...hat.com>, "seanjc@...gle.com"
<seanjc@...gle.com>
CC: "Roy, Patrick" <roypat@...zon.co.uk>, "tabba@...gle.com"
<tabba@...gle.com>, "ackerleytng@...gle.com" <ackerleytng@...gle.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "kvm@...r.kernel.org"
<kvm@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "kvmarm@...ts.linux.dev"
<kvmarm@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"rppt@...nel.org" <rppt@...nel.org>, "will@...nel.org" <will@...nel.org>,
"vbabka@...e.cz" <vbabka@...e.cz>, "Cali, Marco" <xmarcalx@...zon.co.uk>,
"Kalyazin, Nikita" <kalyazin@...zon.co.uk>, "Thomson, Jack"
<jackabt@...zon.co.uk>, "Manwaring, Derek" <derekmn@...zon.com>
Subject: [PATCH v5 00/12] Direct Map Removal Support for guest_memfd
[ based on kvm/next ]
Unmapping virtual machine guest memory from the host kernel's direct map is a
successful mitigation against Spectre-style transient execution issues: If the
kernel page tables do not contain entries pointing to guest memory, then any
attempted speculative read through the direct map will necessarily be blocked
by the MMU before any observable microarchitectural side-effects happen. This
means that Spectre-gadgets and similar cannot be used to target virtual machine
memory. Roughly 60% of speculative execution issues fall into this category [1,
Table 1].
This patch series extends guest_memfd with the ability to remove its memory
from the host kernel's direct map, to be able to attain the above protection
for KVM guests running inside guest_memfd.
=== Design ===
We build on top of guest_memfd's recent support for "non-confidential VMs", in
which all of guest_memfd is mappable to userspace (e.g. considered "shared").
For such VMs, all guest page faults are routed through guest_memfd's special
page fault handler, which due to consuming fd+offset directly, can map direct
map removed memory into the guest. KVM's internal accesses to guest memory are
handled by providing each memslot with a userspace mapping of that memslots
guest_memfd via userspace_addr. Since KVM's internal accesses are almost
exclusively handled via copy_from_user() and friends, this allows KVM to access
direct map removed guest memory for features such as MMIO instruction emulation
on x86 or pvtime support on ARM64.
=== Implementation ===
The KVM_CREATE_GUEST_MEMFD ioctl gains a new flag
GUEST_MEMFD_FLAG_NO_DIRECT_MAP. If this flag is passed, then guest_memfd
removes direct map entries for its folios are preparation. Upon free-ing of the
memory, direct map entries are restored prior to gmem's arch specific
invalidation callback.
Support for the flag can be discovered via the KVM_CAP_GMEM_NO_DIRECT_MAP
capability, which is only available if direct map modifications at 4k
granularity is architecturally possible / when KVM can successfully map direct
map removed memory into the guest.
=== Testing ===
KVM selftests are extended to cover the above-described non-CoCo workflows,
where guest_memfd with direct map entries removed is used to back all of guest
memory, and exercising some simple MMIO paths.
Additionally, a Firecracker branch with support for these VMs can be found on
GitHub [2].
=== Changes since v4 ===
- Rebase on top of kvm/next
- Stop using PG_private to track direct map removal state
- fix build or KVM-as-a-module by using new EXPORT_SYMBOL_FOR_MODULES
=== FAQ ===
--- why not reuse memfd_secret() / a bespoke guest memory solution? ---
having guest memory be direct map removed means guest page faults cannot be
resolved by GUP-ing userspace mappings of guest memory, as GUP is disabled for
direct map removed memory (as currently GUP has no way to understand that a
specific GUP request will not subsequently dereference page_address()).
guest_memfd already has a special path inside KVM that instead consumed
fd+offset, so it makes sense to reuse this. Additionally, it means that
direct-map-removed VMs can benefit from active development on guest_memfd, such
as huge pages support.
--- why do KVM internal accesses through userspace page tables? ---
For traditional VMs, all KVM internal accesses are done through the
userspace_addr stored in a memslot, meaning no changes to most KVM code are
needed just to allow access to guest_memfd backed / direct map removed guest
memory of non-confidential VMs. Previous iterations of this series tried to
avoid userspace mappings, instead attempting to dynamically restore direct map
entries for internal accesses [RFCv2], but this turned out to have a
significant performance impact, as well as additional complexity due to needing
to refcount direct map reinsertion operations and making them play nicely with
gmem truncations.
--- what doesn't work with direct map removed VMs? ---
The only thing I'm aware of is kvm-clock, since it tries to GUP guest memory
via gfn_to_pfn_cache. Realistically, this is only a problem on AMD, as on Intel
guests can use TSC as a clocksource (Intel allows discovery of TSC frequency
via CPUID, while AMD doesn't). AMD guests fall back onto some calibration
routine, which fails most of the time though.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
[v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/
Elliot Berman (1):
filemap: Pass address_space mapping to ->free_folio()
Patrick Roy (11):
arch: export set_direct_map_valid_noflush to KVM module
mm: introduce AS_NO_DIRECT_MAP
KVM: guest_memfd: Add flag to remove from direct map
KVM: Documentation: describe GUEST_MEMFD_FLAG_NO_DIRECT_MAP
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in mem conversion
tests
KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in
guest_memfd_test.c
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/filesystems/locking.rst | 2 +-
Documentation/virt/kvm/api.rst | 5 ++
arch/arm64/include/asm/kvm_host.h | 12 ++++
arch/arm64/mm/pageattr.c | 1 +
arch/loongarch/mm/pageattr.c | 1 +
arch/riscv/mm/pageattr.c | 1 +
arch/s390/mm/pageattr.c | 1 +
arch/x86/mm/pat/set_memory.c | 1 +
fs/nfs/dir.c | 11 ++--
fs/orangefs/inode.c | 3 +-
include/linux/fs.h | 2 +-
include/linux/kvm_host.h | 7 +++
include/linux/pagemap.h | 16 +++++
include/linux/secretmem.h | 18 ------
include/uapi/linux/kvm.h | 2 +
lib/buildid.c | 4 +-
mm/filemap.c | 9 +--
mm/gup.c | 14 +----
mm/mlock.c | 2 +-
mm/secretmem.c | 9 +--
mm/vmscan.c | 4 +-
.../testing/selftests/kvm/guest_memfd_test.c | 2 +
.../testing/selftests/kvm/include/kvm_util.h | 37 ++++++++---
.../testing/selftests/kvm/include/test_util.h | 8 +++
tools/testing/selftests/kvm/lib/elf.c | 8 +--
tools/testing/selftests/kvm/lib/io.c | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 61 +++++++++++--------
tools/testing/selftests/kvm/lib/test_util.c | 8 +++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 50 +++++++++++++--
.../kvm/x86/private_mem_conversions_test.c | 7 ++-
virt/kvm/guest_memfd.c | 32 ++++++++--
virt/kvm/kvm_main.c | 5 ++
34 files changed, 264 insertions(+), 104 deletions(-)
base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383
--
2.50.1
Powered by blists - more mailing lists