[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aN3KfrWERpXsj3ld@google.com>
Date: Wed, 1 Oct 2025 17:42:38 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Yan Zhao <yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>,
Binbin Wu <binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>,
Ira Weiny <ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>,
Vishal Annapurve <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
shareability to guard faulting
On Wed, Oct 01, 2025, Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> >> I'd prefer not to have the module param choose between the use of
> >> mem_attr_array and guest_memfd conversion in case we need both
> >> mem_attr_array to support other stuff in future while supporting
> >> conversions.
> >
> > Luckily, we don't actually need to make a decision on this, because PRIVATE is
> > the only attribute that exists. Which is partly why I want to go with a module
> > param. We can make the behavior very definitive without significant risk of
> > causing ABI hell.
> >
>
> Then maybe I'm misunderstanding the static_call() thing you were
> describing. Is it like, at KVM module initialization time,
>
> if module_param == disable_tracking:
> .__kvm_get_memory_attributes = read_attributes_from_guest_memfd
> else
> .__kvm_get_memory_attributes = read_attributes_from_mem_attr_array
>
> With that, I can't have both CoCo private/shared state tracked in
> guest_memfd and RWX (as an example, could be any future attribute)
> tracked in mem_attr_array on the same VM.
More or less.
> > It's entirely possible I'm completely wrong and we'll end up with per-VM RWX
> > protections and no other per-gmem memory attributes, but as above, unwinding or
> > adjusting the module param will be a drop in the bucket compared to the effort
> > needed to add whatever support comes along.
> >
>
> Is a module param a weaker userspace contract such that the definition
> for module params can be more flexibly adjusted?
Yes, much weaker.
> >> > The kvm_memory_attributes structure is compatible, all that's needed AFAICT is a
> >> > union to clarify it's a pgoff instead of an address when used for guest_memfd.
> >> >
> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> > index 52f6000ab020..e0d8255ac8d2 100644
> >> > --- a/include/uapi/linux/kvm.h
> >> > +++ b/include/uapi/linux/kvm.h
> >> > @@ -1590,7 +1590,10 @@ struct kvm_stats_desc {
> >> > #define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd2, struct kvm_memory_attributes)
> >> >
> >> > struct kvm_memory_attributes {
> >> > - __u64 address;
> >> > + union {
> >> > + __u64 address;
> >> > + __u64 offset;
> >> > + };
> >> > __u64 size;
> >> > __u64 attributes;
> >> > __u64 flags;
> >> >
> >>
> >> struct kvm_memory_attributes doesn't have room for reporting the offset
> >> at which conversion failed (error_offset in the new struct). How do we
> >> handle this? Do we reuse the flags field, or do we not report
> >> error_offset?
> >
> > Write back at address/offset
>
> I think it might be surprising to the userspace program, when it wants
> to check the offset that it had requested and found that it changed due
> to an error, or upon decoding the error, be unable to find the original
> offset it had requested.
It's a somewhat common pattern in the kernel. Updating the offset+size is most
often used with -EAGAIN to say "got this far, try the syscall again from this
point".
> Like,
>
> printf("Error during conversion from offset=%lx with size=%lx, at
> error_offset=%lx", attr.offset, attr.size, attr.error_offset)
>
> would be nicer than
>
> original_offset = attr.offset
> printf("Error during conversion from offset=%lx with size=%lx, at
> error_offset=%lx", original_offset, attr.size, attr.error_offset)
>
> > (and update size too, which I probably forgot to do).
>
> Why does size need to be updated? I think u64 for size is great, and
> size is better than nr_pages since nr_pages differs on different
> platforms based on PAGE_SIZE and also nr_pages introduces the question
> of "was it hugetlb, or a native page size?".
I meant update the number of bytes remaining when updating the offset so that
userspace can redo the ioctl without having to update parameters.
> > Ugh, but it's defined _IOW. I forget if that matters in practice (IIRC, it's not
> > enforced anywhere, i.e. purely informational for userspace).
> >
>
> I didn't notice this IOW vs IORW part, but if it starts getting
> enforced/specified [1] or auto-documented we'd be in trouble.
IOW vs IORW is alread specified in the ioctl. More below.
> At this point, maybe it's better to just have a different ioctl number
> and struct definition. I feel that it would be easier for a user to
> associate/separate
Amusingly, we'd only need a different name along with the IORW thing. A full
ioctl number is comproised of the "directory" (KVM), the number, the size of the
payload, and how the payload is accessed.
#define _IOC(dir,type,nr,size) \
(((dir) << _IOC_DIRSHIFT) | \
((type) << _IOC_TYPESHIFT) | \
((nr) << _IOC_NRSHIFT) | \
((size) << _IOC_SIZESHIFT))
So this:
#define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd2, struct kvm_memory_attributes)
#define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
actually generates two different values, and so is two different ioctls from a
code perspective.
The "size" of the payload is nice to have as it allows userspace to assert that
it's passing the right structure, e.g. this static assert from KVM selftests:
#define kvm_do_ioctl(fd, cmd, arg) \
({ \
kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd)); \
ioctl(fd, cmd, arg); \
})
> + KVM_SET_MEMORY_ATTRIBUTES
> + Is VM ioctl
> + Is a write-only ioctl
> + Is for setting memory attributes at a VM level
> + Use struct kvm_memory_attributes for this
> + KVM_GUEST_MEMFD_SET_MEMORY_ATTRIBUTES (name TBD)
> + Is guest_memfd ioctl
> + Is a read/write ioctl
> + Is for setting memory attributes only for this guest_memfd
> + Use struct guest_memfd_memory_attributes for this
> + Also decode errors from this struct
+ Has extra padding for future expansion (because why not)
If we really truly need a new ioctl, I'd probably prefer KVM_SET_MEMORY_ATTRIBUTES2.
Yeah, it's silly, but I don't think baking GUEST_MEMFD into the names buys us
anything. Then we can use KVM_SET_MEMORY_ATTRIBUTES2 on a VM if the need ever
arises.
Alternative #1 is to try and unwind on failure, but that gets complex, and it
simply can't be done for some CoCo VMs. E.g. a private=>shared conversion for
TDX is descrutive.
Alternative #2 is to make the updates atomic and all-or-nothing, which is what
we did for per-VM attributes. That's doable, but it'd either be much more
complex than telling userspace to retry, or we'd have to lose the maple tree
optimizations (which is effectively what we did for per-VM attributes).
> [1] https://lore.kernel.org/all/20250825181434.3340805-1-sashal@kernel.org/
>
> >> >> static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> >> pgoff_t index, struct folio *folio)
> >> >> {
> >> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >> >>
> >> >> filemap_invalidate_lock_shared(inode->i_mapping);
> >> >>
> >> >> - folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> >> + folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >> >
> >> > I am fairly certain there's a TOCTOU bug here. AFAICT, nothing prevents the
> >> > underlying memory from being converted from shared=>private after checking that
> >> > the page is SHARED.
> >> >
> >>
> >> Conversions take the filemap_invalidate_lock() too, along with
> >> allocations, truncations.
> >>
> >> Because the filemap_invalidate_lock() might be reused for other
> >> fs-specific operations, I didn't do the mt_set_external_lock() thing to
> >> lock at a low level to avoid nested locking or special maple tree code
> >> to avoid taking the lock on other paths.
> >
> > mt_set_external_lock() is a nop. It exists purely for lockdep assertions. Per
> > the comment for MT_FLAGS_LOCK_EXTERN, "mt_lock is not used", LOCK_EXTERN simply
> > tells maple tree to not use/take mt_lock. I.e. it doesn't say "take this lock
> > instead", it says "I'll handle locking".
>
> Thanks for pointing this out!
>
> Conversions (and others) taking the filemap_invalidate_lock() probably
> fixes the TOCTOU bug, right?
Yes, grabbing a reference to the folio under lock and thus elevating its refcount
should prevent conversions to private from that point forward, until the PTE is
zapped and the folio is released:
filemap_invalidate_lock_shared(inode->i_mapping);
if (kvm_gmem_is_shared_mem(inode, vmf->pgoff))
folio = kvm_gmem_get_folio(inode, vmf->pgoff);
else
folio = ERR_PTR(-EACCES);
filemap_invalidate_unlock_shared(inode->i_mapping);
Powered by blists - more mailing lists