[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68ddb87b16415_28f5c229470@iweiny-mobl.notmuch>
Date: Wed, 1 Oct 2025 18:25:47 -0500
From: Ira Weiny <ira.weiny@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>, Sean Christopherson
<seanjc@...gle.com>
CC: <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Yan Zhao
<yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, Binbin Wu
<binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, Ira Weiny
<ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
Annapurve" <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
shareability to guard faulting
Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
>
[snip]
>
> > Internally, that let's us do some fun things in KVM. E.g. if we make the "disable
> > legacy per-VM memory attributes" a read-only module param, then we can wire up a
> > static_call() for kvm_get_memory_attributes() and then kvm_mem_is_private() will
> > Just Work.
> >
> > static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > {
> > return static_call(__kvm_get_memory_attributes)(kvm, gfn);
> > }
> >
> > static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > {
> > return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > }
> >
> > That might trigger some additional surgery if/when we want to support RWX
> > protections on a per-VM basis _and_ a per-gmem basic, but I suspect such churn
> > would pale in comparison to the overall support needed for RWX protections.
> >
>
> RWX protections are more of a VM-level property, if I understood the use
> case correctly that some gfn ranges are to be marked non-executable by
> userspace. Setting RWX within guest_memfd would be kind of awkward since
> userspace must first translate GFN to offset, then set it using the
> offset within guest_memfd. Hence I think it's okay to have RWX stuff go
> through the regular KVM_SET_MEMORY_ATTRIBUTES *VM* ioctl and have it
> tracked in mem_attr_array.
>
> I'd prefer not to have the module param choose between the use of
> mem_attr_array and guest_memfd conversion in case we need both
> mem_attr_array to support other stuff in future while supporting
> conversions.
I'm getting pretty confused on how userspace is going to know which ioctl
to use VM vs gmem.
I was starting to question if going through the VM ioctl should actually
change the guest_memfd flags (shareability).
In a prototype I'm playing with shareability has become a bit field which
I think aligns with the idea of expanding the memory attributes. But I've
had some issues with the TDX tests in trying to decipher when to call
vm_set_memory_attributes() vs guest_memfd_convert_private().
>
> > The kvm_memory_attributes structure is compatible, all that's needed AFAICT is a
> > union to clarify it's a pgoff instead of an address when used for guest_memfd.
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 52f6000ab020..e0d8255ac8d2 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1590,7 +1590,10 @@ struct kvm_stats_desc {
> > #define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd2, struct kvm_memory_attributes)
> >
> > struct kvm_memory_attributes {
> > - __u64 address;
> > + union {
> > + __u64 address;
> > + __u64 offset;
> > + };
> > __u64 size;
> > __u64 attributes;
> > __u64 flags;
> >
>
> struct kvm_memory_attributes doesn't have room for reporting the offset
> at which conversion failed (error_offset in the new struct). How do we
> handle this? Do we reuse the flags field, or do we not report
> error_offset?
We could extend this for gmem's version of the struct.
>
> >> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> >> +
> >> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> >> +{
> >> + return inode->i_mapping->i_private_data;
> >
> > This is a hilarious bad helper. Everyone and their mother is going to think
> > about "private vs. shared" when they see kvm_gmem_private(), at least on the x86
> > side.
> >
>
> Totally missed this interpretation of private, lol. Too many
> interpretations of private: MAP_PRIVATE, CoCo's private vs shared, and
> i_private_data.
>
FWIW this did not confuse me and it probably should have... ;-) I'm fine
with Sean's suggestion though.
> >> +}
> >> +
> >> +#else
> >> +
> >> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> >> +{
> >> + return 0;
> >> +}
> >> +
> >> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> >> +{
> >> + WARN_ONCE("Unexpected call to get shared folio.")
> >> + return NULL;
> >> +}
> >> +
> >> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> >> +
> >> static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> pgoff_t index, struct folio *folio)
> >> {
> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >>
> >> filemap_invalidate_lock_shared(inode->i_mapping);
> >>
> >> - folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> + folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >
> > I am fairly certain there's a TOCTOU bug here. AFAICT, nothing prevents the
> > underlying memory from being converted from shared=>private after checking that
> > the page is SHARED.
> >
>
> Conversions take the filemap_invalidate_lock() too, along with
> allocations, truncations.
>
> Because the filemap_invalidate_lock() might be reused for other
> fs-specific operations, I didn't do the mt_set_external_lock() thing to
> lock at a low level to avoid nested locking or special maple tree code
> to avoid taking the lock on other paths.
I don't think using the filemap_invalidate_lock() is going to work well
here. I've had some hangs on it in my testing and experiments. I think
it is better to specifically lock the state tracking itself. I believe
Michael mentioned this as well in a previous thread.
Ira
[snip]
Powered by blists - more mailing lists