linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzv7kxsobo.fsf@google.com>
Date: Thu, 02 Oct 2025 08:24:59 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Ira Weiny <ira.weiny@...el.com>, Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Yan Zhao <yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, 
	Binbin Wu <binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, 
	Ira Weiny <ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>, 
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

Ira Weiny <ira.weiny@...el.com> writes:

> Ackerley Tng wrote:
>> Sean Christopherson <seanjc@...gle.com> writes:
>> 
>
> [snip]
>
>> 
>> > Internally, that let's us do some fun things in KVM.  E.g. if we make the "disable
>> > legacy per-VM memory attributes" a read-only module param, then we can wire up a
>> > static_call() for kvm_get_memory_attributes() and then kvm_mem_is_private() will
>> > Just Work.
>> >
>> >   static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> >   {
>> > 	return static_call(__kvm_get_memory_attributes)(kvm, gfn);
>> >   }
>> >
>> >   static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>> >   {
>> > 	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>> >   }
>> >
>> > That might trigger some additional surgery if/when we want to support RWX
>> > protections on a per-VM basis _and_ a per-gmem basic, but I suspect such churn
>> > would pale in comparison to the overall support needed for RWX protections.
>> >
>> 
>> RWX protections are more of a VM-level property, if I understood the use
>> case correctly that some gfn ranges are to be marked non-executable by
>> userspace. Setting RWX within guest_memfd would be kind of awkward since
>> userspace must first translate GFN to offset, then set it using the
>> offset within guest_memfd. Hence I think it's okay to have RWX stuff go
>> through the regular KVM_SET_MEMORY_ATTRIBUTES *VM* ioctl and have it
>> tracked in mem_attr_array.
>> 
>> I'd prefer not to have the module param choose between the use of
>> mem_attr_array and guest_memfd conversion in case we need both
>> mem_attr_array to support other stuff in future while supporting
>> conversions.
>
> I'm getting pretty confused on how userspace is going to know which ioctl
> to use VM vs gmem.
>

It is confusing, yes!

> I was starting to question if going through the VM ioctl should actually
> change the guest_memfd flags (shareability).
>

At one of the guest_memfd biweeklies, we came to the conclusion that we
should have a per-VM KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING, which will
disable the use of just KVM_MEMORY_ATTRIBUTE_PRIVATE for the
KVM_SET_MEMORY_ATTRIBUTES ioctl, and
KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING is the only way to enable
conversions for a guest_memfd with mmap() support.

IOW, KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING makes userspace choose
either

+ Using guest_memfd for private memory and some other memory
  (e.g. anonymous memory) for shared memory (aka legacy dual backing)
    + And using KVM_SET_MEMORY_ATTRIBUTES VM ioctl for conversions
    + And using mem_attr_array to track shared/private status

+ Using guest_memfd for both private and shared memory (aka single backing)
    + And using the guest_memfd ioctl for conversions
    + And using guest_memfd shareability to track shared/private status
    
Since userspace has to choose one of the above, there's no point in the
VM ioctl affecting shareability.

Sean's suggestion of a module param moves this choice from VM-level to
KVM/host-level.

> In a prototype I'm playing with shareability has become a bit field which
> I think aligns with the idea of expanding the memory attributes.

I guess this is tangentially related and could do with some profiling,
but we should be careful about adding too many states in the maple tree.

Conversion iterates over offset ranges in the maple tree, and iteration
is faster if there are fewer nodes in the maple tree.

If we just have two states (shared/private) in the maple tree, each node
is either all private or all shared.

If we have more states, private ranges might get fragmented based on the
orthogonal bits, e.g. RWX, which could then impact conversion
performance.

> But I've
> had some issues with the TDX tests in trying to decipher when to call
> vm_set_memory_attributes() vs guest_memfd_convert_private().
>

Hope the above explanation helps!

+ Legacy dual backing: vm_set_memory_attributes()
+ Single backing: guest_memfd_convert_private()

I don't think this will change much even with the module param, since
userspace will still need to know whether to pass in a vm fd or a
guest_memfd fd.

Or maybe vm_set_memory_attributes() can take a vm fd, then query module
params and then figure out if it should pass vm or guest_mefd fds?

>> 
>> [...snip...]
>> 
>> >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >>  				    pgoff_t index, struct folio *folio)
>> >>  {
>> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
>> >>  
>> >>  	filemap_invalidate_lock_shared(inode->i_mapping);
>> >>  
>> >> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>> >> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
>> >
>> > I am fairly certain there's a TOCTOU bug here.  AFAICT, nothing prevents the
>> > underlying memory from being converted from shared=>private after checking that
>> > the page is SHARED.
>> >
>> 
>> Conversions take the filemap_invalidate_lock() too, along with
>> allocations, truncations.
>> 
>> Because the filemap_invalidate_lock() might be reused for other
>> fs-specific operations, I didn't do the mt_set_external_lock() thing to
>> lock at a low level to avoid nested locking or special maple tree code
>> to avoid taking the lock on other paths.
>
> I don't think using the filemap_invalidate_lock() is going to work well
> here.  I've had some hangs on it in my testing and experiments.  I think
> it is better to specifically lock the state tracking itself.  I believe
> Michael mentioned this as well in a previous thread.

Definitely took the big hammer lock for a start and might be optimizable.

Considerations so far: when a conversion is happening, these have to be
locked out:

+ Conversions from competing threads
+ Allocations in kvm_gmem_fault_user_mapping(), because whether an
  offset can be faulted depends on the outcome of conversion
+ Allocations (fallocate() or kvm_gmem_get_pfn()) and truncations,
  because conversions (for now) involves removing a folio from the
  filemap, restructuring, then restoring to the filemap, and
    + Allocations should reuse a folio that was already in the filemap
    + Truncations remove a folio, and should not skip removal of a folio
      because it was taken out just for conversion
+ memory failure handling, where we don't remove folios from the
  filemap, but we might restructure, to split huge folios to just unmap
  pages with failed memory

I think essentially because conversion involves restructuring, and
restructuring involves filemap operations and other filemap operations
have to wait, conversion also takes the filemap_invalidate_lock() that
filemap operations use.

>
> Ira
>
> [snip]