linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68deb76c2dc2a_2957e22943f@iweiny-mobl.notmuch>
Date: Thu, 2 Oct 2025 12:33:32 -0500
From: Ira Weiny <ira.weiny@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>, Ira Weiny <ira.weiny@...el.com>,
	Sean Christopherson <seanjc@...gle.com>
CC: <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Yan Zhao
	<yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, Binbin Wu
	<binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, Ira Weiny
	<ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>,
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

Ackerley Tng wrote:
> Ira Weiny <ira.weiny@...el.com> writes:
> 
> > Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@...gle.com> writes:
> >> 
> >
> > [snip]
> >

[snip]

> >> 
> >> I'd prefer not to have the module param choose between the use of
> >> mem_attr_array and guest_memfd conversion in case we need both
> >> mem_attr_array to support other stuff in future while supporting
> >> conversions.
> >
> > I'm getting pretty confused on how userspace is going to know which ioctl
> > to use VM vs gmem.
> >
> 
> It is confusing, yes!
> 
> > I was starting to question if going through the VM ioctl should actually
> > change the guest_memfd flags (shareability).
> >
> 
> At one of the guest_memfd biweeklies, we came to the conclusion that we
> should have a per-VM KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING, which will
> disable the use of just KVM_MEMORY_ATTRIBUTE_PRIVATE for the
> KVM_SET_MEMORY_ATTRIBUTES ioctl, and
> KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING is the only way to enable
> conversions for a guest_memfd with mmap() support.
> 
> IOW, KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING makes userspace choose
> either
> 
> + Using guest_memfd for private memory and some other memory
>   (e.g. anonymous memory) for shared memory (aka legacy dual backing)
>     + And using KVM_SET_MEMORY_ATTRIBUTES VM ioctl for conversions
>     + And using mem_attr_array to track shared/private status
> 
> + Using guest_memfd for both private and shared memory (aka single backing)
>     + And using the guest_memfd ioctl for conversions
>     + And using guest_memfd shareability to track shared/private status
>     
> Since userspace has to choose one of the above, there's no point in the
> VM ioctl affecting shareability.
> 
> Sean's suggestion of a module param moves this choice from VM-level to
> KVM/host-level.

Ok I remember this discussion but I was not clear on the mechanics.  This
helps clarify things thanks!

> 
> > In a prototype I'm playing with shareability has become a bit field which
> > I think aligns with the idea of expanding the memory attributes.
> 
> I guess this is tangentially related and could do with some profiling,
> but we should be careful about adding too many states in the maple tree.
> 
> Conversion iterates over offset ranges in the maple tree, and iteration
> is faster if there are fewer nodes in the maple tree.
> 
> If we just have two states (shared/private) in the maple tree, each node
> is either all private or all shared.
> 
> If we have more states, private ranges might get fragmented based on the
> orthogonal bits, e.g. RWX, which could then impact conversion
> performance.

I'm thinking along these same lines yea.

> 
> > But I've
> > had some issues with the TDX tests in trying to decipher when to call
> > vm_set_memory_attributes() vs guest_memfd_convert_private().
> >
> 
> Hope the above explanation helps!
> 
> + Legacy dual backing: vm_set_memory_attributes()
> + Single backing: guest_memfd_convert_private()
> 
> I don't think this will change much even with the module param, since
> userspace will still need to know whether to pass in a vm fd or a
> guest_memfd fd.
> 
> Or maybe vm_set_memory_attributes() can take a vm fd, then query module
> params and then figure out if it should pass vm or guest_mefd fds?

For the tests that might help for sure.

Generally I'm hesitant to introduce module parameters as they are pretty
big hammers to change behavior.  But if it makes things easier and is
acceptable then I'm not going to complain...

> 
> >> 
> >> [...snip...]
> >> 
> >> >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> >>  				    pgoff_t index, struct folio *folio)
> >> >>  {
> >> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >> >>  
> >> >>  	filemap_invalidate_lock_shared(inode->i_mapping);
> >> >>  
> >> >> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> >> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >> >
> >> > I am fairly certain there's a TOCTOU bug here.  AFAICT, nothing prevents the
> >> > underlying memory from being converted from shared=>private after checking that
> >> > the page is SHARED.
> >> >
> >> 
> >> Conversions take the filemap_invalidate_lock() too, along with
> >> allocations, truncations.
> >> 
> >> Because the filemap_invalidate_lock() might be reused for other
> >> fs-specific operations, I didn't do the mt_set_external_lock() thing to
> >> lock at a low level to avoid nested locking or special maple tree code
> >> to avoid taking the lock on other paths.
> >
> > I don't think using the filemap_invalidate_lock() is going to work well
> > here.  I've had some hangs on it in my testing and experiments.  I think
> > it is better to specifically lock the state tracking itself.  I believe
> > Michael mentioned this as well in a previous thread.
> 
> Definitely took the big hammer lock for a start and might be optimizable.
> 
> Considerations so far: when a conversion is happening, these have to be
> locked out:
> 
> + Conversions from competing threads

Agreed.  And this needs filemap_invalidate_lock() as well as the maple
tree lock.

Call this item 1.

> + Allocations in kvm_gmem_fault_user_mapping(), because whether an
>   offset can be faulted depends on the outcome of conversion

Agreed.  And this needs filemap_invalidate_lock() as well as the maple
tree lock.

Call this item 2.

> + Allocations (fallocate() or kvm_gmem_get_pfn()) and truncations,
>   because conversions (for now) involves removing a folio from the
>   filemap, restructuring, then restoring to the filemap, and
>     + Allocations should reuse a folio that was already in the filemap
>     + Truncations remove a folio, and should not skip removal of a folio
>       because it was taken out just for conversion

I don't think this is required...

> + memory failure handling, where we don't remove folios from the
>   filemap, but we might restructure, to split huge folios to just unmap
>   pages with failed memory

... nor this.  These don't change the sharability maple tree.

These operations don't change or need to know the shareability AFAICT.

Merging a folio would have to check the maple tree to ensure we don't
merge incompatible folios...  But that is a read check and should be easy
to add.

> I think essentially because conversion involves restructuring, and
> restructuring involves filemap operations and other filemap operations
> have to wait, conversion also takes the filemap_invalidate_lock() that
> filemap operations use.

I could be convinced otherwise but I'm thinking the overhead of another
lock for the sake of simplicity is a good trade off.  I don't think any of
the conversions are a fast path operation are they?

Ira

[snip]