linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aN3KfrWERpXsj3ld@google.com>
Date: Wed, 1 Oct 2025 17:42:38 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Yan Zhao <yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, 
	Binbin Wu <binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, 
	Ira Weiny <ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>, 
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

On Wed, Oct 01, 2025, Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> >> I'd prefer not to have the module param choose between the use of
> >> mem_attr_array and guest_memfd conversion in case we need both
> >> mem_attr_array to support other stuff in future while supporting
> >> conversions.
> >
> > Luckily, we don't actually need to make a decision on this, because PRIVATE is
> > the only attribute that exists.  Which is partly why I want to go with a module
> > param.  We can make the behavior very definitive without significant risk of
> > causing ABI hell.
> >
> 
> Then maybe I'm misunderstanding the static_call() thing you were
> describing. Is it like, at KVM module initialization time,
> 
>     if module_param == disable_tracking:
>         .__kvm_get_memory_attributes = read_attributes_from_guest_memfd
>     else
>         .__kvm_get_memory_attributes = read_attributes_from_mem_attr_array
> 
> With that, I can't have both CoCo private/shared state tracked in
> guest_memfd and RWX (as an example, could be any future attribute)
> tracked in mem_attr_array on the same VM.

More or less.

> > It's entirely possible I'm completely wrong and we'll end up with per-VM RWX
> > protections and no other per-gmem memory attributes, but as above, unwinding or
> > adjusting the module param will be a drop in the bucket compared to the effort
> > needed to add whatever support comes along.
> >
> 
> Is a module param a weaker userspace contract such that the definition
> for module params can be more flexibly adjusted?

Yes, much weaker.

> >> > The kvm_memory_attributes structure is compatible, all that's needed AFAICT is a
> >> > union to clarify it's a pgoff instead of an address when used for guest_memfd.
> >> >
> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> > index 52f6000ab020..e0d8255ac8d2 100644
> >> > --- a/include/uapi/linux/kvm.h
> >> > +++ b/include/uapi/linux/kvm.h
> >> > @@ -1590,7 +1590,10 @@ struct kvm_stats_desc {
> >> >  #define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
> >> >  
> >> >  struct kvm_memory_attributes {
> >> > -       __u64 address;
> >> > +       union {
> >> > +               __u64 address;
> >> > +               __u64 offset;
> >> > +       };
> >> >         __u64 size;
> >> >         __u64 attributes;
> >> >         __u64 flags;
> >> >
> >> 
> >> struct kvm_memory_attributes doesn't have room for reporting the offset
> >> at which conversion failed (error_offset in the new struct). How do we
> >> handle this? Do we reuse the flags field, or do we not report
> >> error_offset?
> >
> > Write back at address/offset
> 
> I think it might be surprising to the userspace program, when it wants
> to check the offset that it had requested and found that it changed due
> to an error, or upon decoding the error, be unable to find the original
> offset it had requested.

It's a somewhat common pattern in the kernel.  Updating the offset+size is most
often used with -EAGAIN to say "got this far, try the syscall again from this
point".

> Like,
> 
>     printf("Error during conversion from offset=%lx with size=%lx, at
>            error_offset=%lx", attr.offset, attr.size, attr.error_offset)
> 
> would be nicer than 
> 
>     original_offset = attr.offset
>     printf("Error during conversion from offset=%lx with size=%lx, at
>            error_offset=%lx", original_offset, attr.size, attr.error_offset)
>            
> > (and update size too, which I probably forgot to do).
> 
> Why does size need to be updated? I think u64 for size is great, and
> size is better than nr_pages since nr_pages differs on different
> platforms based on PAGE_SIZE and also nr_pages introduces the question
> of "was it hugetlb, or a native page size?".

I meant update the number of bytes remaining when updating the offset so that
userspace can redo the ioctl without having to update parameters.

> > Ugh, but it's defined _IOW.  I forget if that matters in practice (IIRC, it's not
> > enforced anywhere, i.e. purely informational for userspace).
> >
> 
> I didn't notice this IOW vs IORW part, but if it starts getting
> enforced/specified [1] or auto-documented we'd be in trouble.

IOW vs IORW is alread specified in the ioctl.  More below.

> At this point, maybe it's better to just have a different ioctl number
> and struct definition. I feel that it would be easier for a user to
> associate/separate

Amusingly, we'd only need a different name along with the IORW thing.  A full
ioctl number is comproised of the "directory" (KVM), the number, the size of the
payload, and how the payload is accessed.

#define _IOC(dir,type,nr,size) \
	(((dir)  << _IOC_DIRSHIFT) | \
	 ((type) << _IOC_TYPESHIFT) | \
	 ((nr)   << _IOC_NRSHIFT) | \
	 ((size) << _IOC_SIZESHIFT))

So this:

  #define KVM_SET_MEMORY_ATTRIBUTES	_IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
  #define KVM_SET_MEMORY_ATTRIBUTES2	_IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)

actually generates two different values, and so is two different ioctls from a
code perspective.

The "size" of the payload is nice to have as it allows userspace to assert that
it's passing the right structure, e.g. this static assert from KVM selftests:

#define kvm_do_ioctl(fd, cmd, arg)						\
({										\
	kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd));	\
	ioctl(fd, cmd, arg);							\
})

> + KVM_SET_MEMORY_ATTRIBUTES
>     + Is VM ioctl
>     + Is a write-only ioctl
>     + Is for setting memory attributes at a VM level
>     + Use struct kvm_memory_attributes for this
> + KVM_GUEST_MEMFD_SET_MEMORY_ATTRIBUTES (name TBD)
>     + Is guest_memfd ioctl
>     + Is a read/write ioctl
>     + Is for setting memory attributes only for this guest_memfd
>     + Use struct guest_memfd_memory_attributes for this
>     + Also decode errors from this struct

      + Has extra padding for future expansion (because why not)

If we really truly need a new ioctl, I'd probably prefer KVM_SET_MEMORY_ATTRIBUTES2.
Yeah, it's silly, but I don't think baking GUEST_MEMFD into the names buys us
anything.  Then we can use KVM_SET_MEMORY_ATTRIBUTES2 on a VM if the need ever
arises.

Alternative #1 is to try and unwind on failure, but that gets complex, and it
simply can't be done for some CoCo VMs.  E.g. a private=>shared conversion for
TDX is descrutive.

Alternative #2 is to make the updates atomic and all-or-nothing, which is what
we did for per-VM attributes.  That's doable, but it'd either be much more
complex than telling userspace to retry, or we'd have to lose the maple tree
optimizations (which is effectively what we did for per-VM attributes).

> [1] https://lore.kernel.org/all/20250825181434.3340805-1-sashal@kernel.org/
> 
> >> >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> >>  				    pgoff_t index, struct folio *folio)
> >> >>  {
> >> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >> >>  
> >> >>  	filemap_invalidate_lock_shared(inode->i_mapping);
> >> >>  
> >> >> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> >> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >> >
> >> > I am fairly certain there's a TOCTOU bug here.  AFAICT, nothing prevents the
> >> > underlying memory from being converted from shared=>private after checking that
> >> > the page is SHARED.
> >> >
> >> 
> >> Conversions take the filemap_invalidate_lock() too, along with
> >> allocations, truncations.
> >> 
> >> Because the filemap_invalidate_lock() might be reused for other
> >> fs-specific operations, I didn't do the mt_set_external_lock() thing to
> >> lock at a low level to avoid nested locking or special maple tree code
> >> to avoid taking the lock on other paths.
> >
> > mt_set_external_lock() is a nop.  It exists purely for lockdep assertions.  Per
> > the comment for MT_FLAGS_LOCK_EXTERN, "mt_lock is not used", LOCK_EXTERN simply
> > tells maple tree to not use/take mt_lock.   I.e. it doesn't say "take this lock
> > instead", it says "I'll handle locking".
> 
> Thanks for pointing this out!
> 
> Conversions (and others) taking the filemap_invalidate_lock() probably
> fixes the TOCTOU bug, right?

Yes, grabbing a reference to the folio under lock and thus elevating its refcount
should prevent conversions to private from that point forward, until the PTE is
zapped and the folio is released:

	filemap_invalidate_lock_shared(inode->i_mapping);
	if (kvm_gmem_is_shared_mem(inode, vmf->pgoff))
		folio = kvm_gmem_get_folio(inode, vmf->pgoff);
	else
		folio = ERR_PTR(-EACCES);
	filemap_invalidate_unlock_shared(inode->i_mapping);