[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251205165743.9341-6-kalyazin@amazon.com>
Date: Fri, 5 Dec 2025 16:58:43 +0000
From: "Kalyazin, Nikita" <kalyazin@...zon.co.uk>
To: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-doc@...r.kernel.org"
<linux-doc@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "kvmarm@...ts.linux.dev"
<kvmarm@...ts.linux.dev>, "linux-fsdevel@...r.kernel.org"
<linux-fsdevel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "corbet@....net"
<corbet@....net>, "maz@...nel.org" <maz@...nel.org>, "oupton@...nel.org"
<oupton@...nel.org>, "joey.gouly@....com" <joey.gouly@....com>,
"suzuki.poulose@....com" <suzuki.poulose@....com>, "yuzenghui@...wei.com"
<yuzenghui@...wei.com>, "catalin.marinas@....com" <catalin.marinas@....com>,
"will@...nel.org" <will@...nel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
"tglx@...utronix.de" <tglx@...utronix.de>, "mingo@...hat.com"
<mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "x86@...nel.org"
<x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>, "luto@...nel.org"
<luto@...nel.org>, "peterz@...radead.org" <peterz@...radead.org>,
"willy@...radead.org" <willy@...radead.org>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "david@...nel.org" <david@...nel.org>,
"lorenzo.stoakes@...cle.com" <lorenzo.stoakes@...cle.com>,
"Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>, "vbabka@...e.cz"
<vbabka@...e.cz>, "rppt@...nel.org" <rppt@...nel.org>, "surenb@...gle.com"
<surenb@...gle.com>, "mhocko@...e.com" <mhocko@...e.com>, "ast@...nel.org"
<ast@...nel.org>, "daniel@...earbox.net" <daniel@...earbox.net>,
"andrii@...nel.org" <andrii@...nel.org>, "martin.lau@...ux.dev"
<martin.lau@...ux.dev>, "eddyz87@...il.com" <eddyz87@...il.com>,
"song@...nel.org" <song@...nel.org>, "yonghong.song@...ux.dev"
<yonghong.song@...ux.dev>, "john.fastabend@...il.com"
<john.fastabend@...il.com>, "kpsingh@...nel.org" <kpsingh@...nel.org>,
"sdf@...ichev.me" <sdf@...ichev.me>, "haoluo@...gle.com" <haoluo@...gle.com>,
"jolsa@...nel.org" <jolsa@...nel.org>, "jgg@...pe.ca" <jgg@...pe.ca>,
"jhubbard@...dia.com" <jhubbard@...dia.com>, "peterx@...hat.com"
<peterx@...hat.com>, "jannh@...gle.com" <jannh@...gle.com>,
"pfalcato@...e.de" <pfalcato@...e.de>, "shuah@...nel.org" <shuah@...nel.org>,
"riel@...riel.com" <riel@...riel.com>, "baohua@...nel.org"
<baohua@...nel.org>, "ryan.roberts@....com" <ryan.roberts@....com>,
"jgross@...e.com" <jgross@...e.com>, "yu-cheng.yu@...el.com"
<yu-cheng.yu@...el.com>, "kas@...nel.org" <kas@...nel.org>, "coxu@...hat.com"
<coxu@...hat.com>, "kevin.brodsky@....com" <kevin.brodsky@....com>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "maobibo@...ngson.cn"
<maobibo@...ngson.cn>, "prsampat@....com" <prsampat@....com>,
"mlevitsk@...hat.com" <mlevitsk@...hat.com>, "isaku.yamahata@...el.com"
<isaku.yamahata@...el.com>, "jmattson@...gle.com" <jmattson@...gle.com>,
"jthoughton@...gle.com" <jthoughton@...gle.com>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "vannapurve@...gle.com"
<vannapurve@...gle.com>, "jackmanb@...gle.com" <jackmanb@...gle.com>,
"aneesh.kumar@...nel.org" <aneesh.kumar@...nel.org>, "patrick.roy@...ux.dev"
<patrick.roy@...ux.dev>, "Thomson, Jack" <jackabt@...zon.co.uk>, "Itazuri,
Takahiro" <itazur@...zon.co.uk>, "Manwaring, Derek" <derekmn@...zon.com>,
"Cali, Marco" <xmarcalx@...zon.co.uk>, "Kalyazin, Nikita"
<kalyazin@...zon.co.uk>
Subject: [PATCH v8 05/13] KVM: guest_memfd: Add flag to remove from direct map
From: Patrick Roy <patrick.roy@...ux.dev>
Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.
To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated. In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.
Direct map entries are zapped right before guest or userspace mappings
of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
a gmem folio can be allocated without being mapped anywhere is
kvm_gmem_populate(), where handling potential failures of direct map
removal is not possible (by the time direct map removal is attempted,
the folio is already marked as prepared, meaning attempting to re-try
kvm_gmem_populate() would just result in -EEXIST without fixing up the
direct map state). These folios are then removed form the direct map
upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
Signed-off-by: Patrick Roy <patrick.roy@...ux.dev>
Signed-off-by: Nikita Kalyazin <kalyazin@...zon.com>
---
Documentation/virt/kvm/api.rst | 22 ++++++++-----
include/linux/kvm_host.h | 12 +++++++
include/uapi/linux/kvm.h | 1 +
virt/kvm/guest_memfd.c | 60 ++++++++++++++++++++++++++++++++++
4 files changed, 86 insertions(+), 9 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 01a3abef8abb..c5f54f1370c8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6440,15 +6440,19 @@ a single guest_memfd file, but the bound ranges must not overlap).
The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be
specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags:
- ============================ ================================================
- GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
- descriptor.
- GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
- KVM_CREATE_GUEST_MEMFD (memory files created
- without INIT_SHARED will be marked private).
- Shared memory can be faulted into host userspace
- page tables. Private memory cannot.
- ============================ ================================================
+ ============================== ================================================
+ GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file
+ descriptor.
+ GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during
+ KVM_CREATE_GUEST_MEMFD (memory files created
+ without INIT_SHARED will be marked private).
+ Shared memory can be faulted into host userspace
+ page tables. Private memory cannot.
+ GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will behave similarly
+ to memfd_secret, and unmaps the memory backing
+ it from the kernel's address space before
+ being passed off to userspace or the guest.
+ ============================== ================================================
When the KVM MMU performs a PFN lookup to service a guest fault and the backing
guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27796a09d29b..d4d5306075bf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -738,10 +738,22 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+ if (kvm_arch_gmem_supports_no_direct_map())
+ flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+
return flags;
}
#endif
+#ifdef CONFIG_KVM_GUEST_MEMFD
+#ifndef kvm_arch_gmem_supports_no_direct_map
+static inline bool kvm_arch_gmem_supports_no_direct_map(void)
+{
+ return false;
+}
+#endif
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
#ifndef kvm_arch_has_readonly_mem
static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
{
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dddb781b0507..60341e1ba1be 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1612,6 +1612,7 @@ struct kvm_memory_attributes {
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
#define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
#define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
+#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 2)
struct kvm_create_guest_memfd {
__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 92e7f8c1f303..ec4966a47d5e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -7,6 +7,9 @@
#include <linux/mempolicy.h>
#include <linux/pseudo_fs.h>
#include <linux/pagemap.h>
+#include <linux/set_memory.h>
+
+#include <asm/tlbflush.h>
#include "kvm_mm.h"
@@ -76,6 +79,49 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
return 0;
}
+#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
+
+static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
+{
+ return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
+}
+
+static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
+{
+ int r = 0;
+ unsigned long addr = (unsigned long) folio_address(folio);
+ u64 gmem_flags = GMEM_I(folio_inode(folio))->flags;
+
+ if (kvm_gmem_folio_no_direct_map(folio) || !(gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP))
+ goto out;
+
+ r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
+ false);
+
+ if (r)
+ goto out;
+
+ folio->private = (void *) KVM_GMEM_FOLIO_NO_DIRECT_MAP;
+ flush_tlb_kernel_range(addr, addr + folio_size(folio));
+
+out:
+ return r;
+}
+
+static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
+{
+ /*
+ * Direct map restoration cannot fail, as the only error condition
+ * for direct map manipulation is failure to allocate page tables
+ * when splitting huge pages, but this split would have already
+ * happened in set_direct_map_invalid_noflush() in kvm_gmem_folio_zap_direct_map().
+ * Thus set_direct_map_valid_noflush() here only updates prot bits.
+ */
+ if (kvm_gmem_folio_no_direct_map(folio))
+ set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
+ true);
+}
+
static inline void kvm_gmem_mark_prepared(struct folio *folio)
{
folio_mark_uptodate(folio);
@@ -398,6 +444,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
struct folio *folio;
vm_fault_t ret = VM_FAULT_LOCKED;
+ int err;
if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
return VM_FAULT_SIGBUS;
@@ -423,6 +470,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
kvm_gmem_mark_prepared(folio);
}
+ err = kvm_gmem_folio_zap_direct_map(folio);
+ if (err) {
+ ret = vmf_error(err);
+ goto out_folio;
+ }
+
vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio:
@@ -533,6 +586,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
kvm_pfn_t pfn = page_to_pfn(page);
int order = folio_order(folio);
+ kvm_gmem_folio_restore_direct_map(folio);
+
kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
}
@@ -596,6 +651,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+ if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
+ mapping_set_no_direct_map(inode->i_mapping);
+
GMEM_I(inode)->flags = flags;
file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
@@ -807,6 +865,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
if (!is_prepared)
r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
+ kvm_gmem_folio_zap_direct_map(folio);
+
folio_unlock(folio);
if (!r)
--
2.50.1
Powered by blists - more mailing lists