lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250925155237.3928-1-roypat@amazon.co.uk>
Date: Thu, 25 Sep 2025 15:52:38 +0000
From: "Roy, Patrick" <roypat@...zon.co.uk>
To: "david@...hat.com" <david@...hat.com>
CC: "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "andrii@...nel.org"
	<andrii@...nel.org>, "ast@...nel.org" <ast@...nel.org>, "bp@...en8.de"
	<bp@...en8.de>, "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
	"catalin.marinas@....com" <catalin.marinas@....com>, "corbet@....net"
	<corbet@....net>, "daniel@...earbox.net" <daniel@...earbox.net>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"derekmn@...zon.co.uk" <derekmn@...zon.co.uk>, "eddyz87@...il.com"
	<eddyz87@...il.com>, "haoluo@...gle.com" <haoluo@...gle.com>, "hpa@...or.com"
	<hpa@...or.com>, "Thomson, Jack" <jackabt@...zon.co.uk>, "jannh@...gle.com"
	<jannh@...gle.com>, "jgg@...pe.ca" <jgg@...pe.ca>, "jhubbard@...dia.com"
	<jhubbard@...dia.com>, "joey.gouly@....com" <joey.gouly@....com>,
	"john.fastabend@...il.com" <john.fastabend@...il.com>, "jolsa@...nel.org"
	<jolsa@...nel.org>, "Kalyazin, Nikita" <kalyazin@...zon.co.uk>,
	"kpsingh@...nel.org" <kpsingh@...nel.org>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
	"linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "linux-doc@...r.kernel.org"
	<linux-doc@...r.kernel.org>, "linux-fsdevel@...r.kernel.org"
	<linux-fsdevel@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-kselftest@...r.kernel.org"
	<linux-kselftest@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"lorenzo.stoakes@...cle.com" <lorenzo.stoakes@...cle.com>, "luto@...nel.org"
	<luto@...nel.org>, "martin.lau@...ux.dev" <martin.lau@...ux.dev>,
	"maz@...nel.org" <maz@...nel.org>, "mhocko@...e.com" <mhocko@...e.com>,
	"mingo@...hat.com" <mingo@...hat.com>, "oliver.upton@...ux.dev"
	<oliver.upton@...ux.dev>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"peterx@...hat.com" <peterx@...hat.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "pfalcato@...e.de" <pfalcato@...e.de>, "Roy, Patrick"
	<roypat@...zon.co.uk>, "rppt@...nel.org" <rppt@...nel.org>, "sdf@...ichev.me"
	<sdf@...ichev.me>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"shuah@...nel.org" <shuah@...nel.org>, "song@...nel.org" <song@...nel.org>,
	"surenb@...gle.com" <surenb@...gle.com>, "suzuki.poulose@....com"
	<suzuki.poulose@....com>, "tabba@...gle.com" <tabba@...gle.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"will@...nel.org" <will@...nel.org>, "willy@...radead.org"
	<willy@...radead.org>, "x86@...nel.org" <x86@...nel.org>, "Cali, Marco"
	<xmarcalx@...zon.co.uk>, "yonghong.song@...ux.dev" <yonghong.song@...ux.dev>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>
Subject: Re: [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct
 map

On Thu, 2025-09-25 at 12:00 +0100, David Hildenbrand wrote:
> On 24.09.25 17:22, Roy, Patrick wrote:
>> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
>> ioctl. When set, guest_memfd folios will be removed from the direct map
>> after preparation, with direct map entries only restored when the folios
>> are freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>>
>> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
>> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
>> guest_memfd itself being supported, but also on whether linux supports
>> manipulatomg the direct map at page granularity at all (possible most of
>> the time, outliers being arm64 where its impossible if the direct map
>> has been setup using hugepages, as arm64 cannot break these apart due to
>> break-before-make semantics, and powerpc, which does not select
>> ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd
>> anyway).
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
>> entries of also the shared parts of guest_memfd are a special type of
>> non-CoCo VM where, host userspace is trusted to have access to all of
>> guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated.  In this
>> setup, KVM retains access to guest memory via userspace mappings of
>> guest_memfd, which are reflected back into KVM's memslots via
>> userspace_addr. This is needed for things like MMIO emulation on x86_64
>> to work.
>>
>> Direct map entries are zapped right before guest or userspace mappings
>> of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
>> kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
>> a gmem folio can be allocated without being mapped anywhere is
>> kvm_gmem_populate(), where handling potential failures of direct map
>> removal is not possible (by the time direct map removal is attempted,
>> the folio is already marked as prepared, meaning attempting to re-try
>> kvm_gmem_populate() would just result in -EEXIST without fixing up the
>> direct map state). These folios are then removed form the direct map
>> upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
>>
>> Signed-off-by: Patrick Roy <roypat@...zon.co.uk>
>> ---
>>   Documentation/virt/kvm/api.rst    |  5 +++
>>   arch/arm64/include/asm/kvm_host.h | 12 ++++++
>>   include/linux/kvm_host.h          |  6 +++
>>   include/uapi/linux/kvm.h          |  2 +
>>   virt/kvm/guest_memfd.c            | 61 ++++++++++++++++++++++++++++++-
>>   virt/kvm/kvm_main.c               |  5 +++
>>   6 files changed, 90 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index c17a87a0a5ac..b52c14d58798 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>>   supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>>   enables mmap() and faulting of guest_memfd memory to host userspace.
>>
>> +When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
>> +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
>> +instance behave similarly to memfd_secret, and unmaps the memory backing it from
>> +the kernel's address space after allocation.
>> +
> 
> Do we want to document what the implication of that is? Meaning,
> limitations etc. I recall that we would need the user mapping for gmem
> slots to be properly set up.
> 
> Is that still the case in this patch set?

The ->userspace_addr thing is the general requirement for non-CoCo VMs,
and not specific for direct map removal (e.g. I expect direct map
removal to just work out of the box for CoCo setups, where KVM already
cannot access guest memory, ignoring the question of whether direct map
removal is even useful for CoCo VMs). So I don't think it should be
documented as part of
KVM_CAP_GMEM_NO_DIRECT_MAP/GUEST_MEMFG_FLAG_NO_DIRECT_MAP (heh, there's
a typo I just noticed. "MEMFG". Also "GMEM" needs to be "GUEST_MEMFD".
Will fix that), but rather as part of GUEST_MEMFD_FLAG_MMAP. I can add a
patch it there (or maybe send it separately, since FLAG_MMAP is already
in -next?).

>>   When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>>   guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>>   consumed from guest_memfd, regardless of whether it is a shared or a private
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 2f2394cce24e..0bfd8e5fd9de 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -19,6 +19,7 @@
>>   #include <linux/maple_tree.h>
>>   #include <linux/percpu.h>
>>   #include <linux/psci.h>
>> +#include <linux/set_memory.h>
>>   #include <asm/arch_gicv3.h>
>>   #include <asm/barrier.h>
>>   #include <asm/cpufeature.h>
>> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>>   void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>>   void check_feature_map(void);
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
>> +{
>> +     /*
>> +      * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
>> +      * as it calls dcache_clean_inval_poc().
>> +      */
>> +     return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
>> +}
>> +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
>> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>>
> 
> I strongly assume that the aarch64 support should be moved to a separate
> patch -- if possible, see below.
> 
>>   #endif /* __ARM64_KVM_HOST_H__ */
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 1d0585616aa3..73a15cade54a 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>>   bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
>>   #endif
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +#ifndef kvm_arch_gmem_supports_no_direct_map
>> +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
>> +#endif
> 
> Hm, wouldn't it be better to have an opt-in per arch, and really only
> unlock the ones we know work (tested etc), explicitly in separate patches.
> 

Ack, can definitely do that. Something like 

#ifndef kvm_arch_gmem_supports_no_direct_map
static inline bool kvm_arch_gmem_supports_no_direct_map()
{
	return false;
}
#endif

and then actual definitions (in separate patches) in the arm64 and x86
headers?

On a related note, maybe PATCH 2 should only export
set_direct_map_valid_noflush() for the architectures on which we
actually need it? Which would only be x86, since arm64 doesnt allow
building KVM as a module, and nothing else supports guest_memfd right
now.

> [...]
> 
>>
>>   #include "kvm_mm.h"
>>
>> @@ -42,6 +45,44 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>>       return 0;
>>   }
>>
>> +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
>> +
>> +static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
>> +{
>> +     return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
>> +}
>> +
>> +static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
>> +{
>> +     if (kvm_gmem_folio_no_direct_map(folio))
>> +             return 0;
>> +
>> +     int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                      false);
>> +
>> +     if (!r) {
>> +             unsigned long addr = (unsigned long) folio_address(folio);
> 
> empty line missing.
> 

Ack

>> +             folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>> +             flush_tlb_kernel_range(addr, addr + folio_size(folio));
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>> +{
>> +     /*
>> +      * Direct map restoration cannot fail, as the only error condition
>> +      * for direct map manipulation is failure to allocate page tables
>> +      * when splitting huge pages, but this split would have already
>> +      * happened in set_direct_map_invalid_noflush() in kvm_gmem_folio_zap_direct_map().
>> +      * Thus set_direct_map_valid_noflush() here only updates prot bits.
>> +      */
>> +     if (kvm_gmem_folio_no_direct_map(folio))
>> +             set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                      true);
>> +}
>> +
>>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>   {
>>       folio_mark_uptodate(folio);
>> @@ -324,13 +365,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>       struct inode *inode = file_inode(vmf->vma->vm_file);
>>       struct folio *folio;
>>       vm_fault_t ret = VM_FAULT_LOCKED;
>> +     int err;
>>
>>       if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>>               return VM_FAULT_SIGBUS;
>>
>>       folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>       if (IS_ERR(folio)) {
>> -             int err = PTR_ERR(folio);
>> +             err = PTR_ERR(folio);
>>
>>               if (err == -EAGAIN)
>>                       return VM_FAULT_RETRY;
>> @@ -348,6 +390,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>               kvm_gmem_mark_prepared(folio);
>>       }
>>
>> +     err = kvm_gmem_folio_zap_direct_map(folio);
>> +
> 
> I'd drop this empty line here.
>

Ack

>> +     if (err) {
>> +             ret = vmf_error(err);
>> +             goto out_folio;
>> +     }
>> +
>>       vmf->page = folio_file_page(folio, vmf->pgoff);
>>
>>   out_folio:
>> @@ -435,6 +484,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>>       kvm_pfn_t pfn = page_to_pfn(page);
>>       int order = folio_order(folio);
>>
>> +     kvm_gmem_folio_restore_direct_map(folio);
>> +
>>       kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>>   }
>>
>> @@ -499,6 +550,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>       /* Unmovable mappings are supposed to be marked unevictable as well. */
>>       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>
>> +     if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
>> +             mapping_set_no_direct_map(inode->i_mapping);
>> +
>>       kvm_get_kvm(kvm);
>>       gmem->kvm = kvm;
>>       xa_init(&gmem->bindings);
>> @@ -523,6 +577,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>       if (kvm_arch_supports_gmem_mmap(kvm))
>>               valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>>
>> +     if (kvm_arch_gmem_supports_no_direct_map())
>> +             valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
>> +
>>       if (flags & ~valid_flags)
>>               return -EINVAL;
>>
>> @@ -687,6 +744,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>       if (!is_prepared)
>>               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>
>> +     kvm_gmem_folio_zap_direct_map(folio);
>> +
>>       folio_unlock(folio);
>>
>>       if (!r)
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 18f29ef93543..b5e702d95230 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -65,6 +65,7 @@
>>   #include <trace/events/kvm.h>
>>
>>   #include <linux/kvm_dirty_ring.h>
>> +#include <linux/set_memory.h>
> 
> Likely not required here.
> 

Seems for now it is because of how
kvm_arch_gmem_supports_no_direct_map() is defined, but I suspect the
need will disappear once that is changed as you suggested above :)

> -- 
> Cheers
> 
> David / dhildenb

Best,
Patrick



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ