linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68ddb87b16415_28f5c229470@iweiny-mobl.notmuch>
Date: Wed, 1 Oct 2025 18:25:47 -0500
From: Ira Weiny <ira.weiny@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>, Sean Christopherson
	<seanjc@...gle.com>
CC: <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Yan Zhao
	<yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, Binbin Wu
	<binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, Ira Weiny
	<ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>,
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> 

[snip]

> 
> > Internally, that let's us do some fun things in KVM.  E.g. if we make the "disable
> > legacy per-VM memory attributes" a read-only module param, then we can wire up a
> > static_call() for kvm_get_memory_attributes() and then kvm_mem_is_private() will
> > Just Work.
> >
> >   static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> >   {
> > 	return static_call(__kvm_get_memory_attributes)(kvm, gfn);
> >   }
> >
> >   static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> >   {
> > 	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> >   }
> >
> > That might trigger some additional surgery if/when we want to support RWX
> > protections on a per-VM basis _and_ a per-gmem basic, but I suspect such churn
> > would pale in comparison to the overall support needed for RWX protections.
> >
> 
> RWX protections are more of a VM-level property, if I understood the use
> case correctly that some gfn ranges are to be marked non-executable by
> userspace. Setting RWX within guest_memfd would be kind of awkward since
> userspace must first translate GFN to offset, then set it using the
> offset within guest_memfd. Hence I think it's okay to have RWX stuff go
> through the regular KVM_SET_MEMORY_ATTRIBUTES *VM* ioctl and have it
> tracked in mem_attr_array.
> 
> I'd prefer not to have the module param choose between the use of
> mem_attr_array and guest_memfd conversion in case we need both
> mem_attr_array to support other stuff in future while supporting
> conversions.

I'm getting pretty confused on how userspace is going to know which ioctl
to use VM vs gmem.

I was starting to question if going through the VM ioctl should actually
change the guest_memfd flags (shareability).

In a prototype I'm playing with shareability has become a bit field which
I think aligns with the idea of expanding the memory attributes.  But I've
had some issues with the TDX tests in trying to decipher when to call
vm_set_memory_attributes() vs guest_memfd_convert_private().

> 
> > The kvm_memory_attributes structure is compatible, all that's needed AFAICT is a
> > union to clarify it's a pgoff instead of an address when used for guest_memfd.
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 52f6000ab020..e0d8255ac8d2 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1590,7 +1590,10 @@ struct kvm_stats_desc {
> >  #define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
> >  
> >  struct kvm_memory_attributes {
> > -       __u64 address;
> > +       union {
> > +               __u64 address;
> > +               __u64 offset;
> > +       };
> >         __u64 size;
> >         __u64 attributes;
> >         __u64 flags;
> >
> 
> struct kvm_memory_attributes doesn't have room for reporting the offset
> at which conversion failed (error_offset in the new struct). How do we
> handle this? Do we reuse the flags field, or do we not report
> error_offset?

We could extend this for gmem's version of the struct.

> 
> >> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> >> +
> >> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> >> +{
> >> +	return inode->i_mapping->i_private_data;
> >
> > This is a hilarious bad helper.  Everyone and their mother is going to think
> > about "private vs. shared" when they see kvm_gmem_private(), at least on the x86
> > side.
> >
> 
> Totally missed this interpretation of private, lol. Too many
> interpretations of private: MAP_PRIVATE, CoCo's private vs shared, and
> i_private_data.
> 

FWIW this did not confuse me and it probably should have...  ;-)  I'm fine
with Sean's suggestion though.

> >> +}
> >> +
> >> +#else
> >> +
> >> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> >> +{
> >> +	WARN_ONCE("Unexpected call to get shared folio.")
> >> +	return NULL;
> >> +}
> >> +
> >> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> >> +
> >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >>  				    pgoff_t index, struct folio *folio)
> >>  {
> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >>  
> >>  	filemap_invalidate_lock_shared(inode->i_mapping);
> >>  
> >> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >
> > I am fairly certain there's a TOCTOU bug here.  AFAICT, nothing prevents the
> > underlying memory from being converted from shared=>private after checking that
> > the page is SHARED.
> >
> 
> Conversions take the filemap_invalidate_lock() too, along with
> allocations, truncations.
> 
> Because the filemap_invalidate_lock() might be reused for other
> fs-specific operations, I didn't do the mt_set_external_lock() thing to
> lock at a low level to avoid nested locking or special maple tree code
> to avoid taking the lock on other paths.

I don't think using the filemap_invalidate_lock() is going to work well
here.  I've had some hangs on it in my testing and experiments.  I think
it is better to specifically lock the state tracking itself.  I believe
Michael mentioned this as well in a previous thread.

Ira

[snip]