linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzy0ptspzl.fsf@google.com>
Date: Thu, 02 Oct 2025 07:49:02 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Yan Zhao <yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, 
	Binbin Wu <binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, 
	Ira Weiny <ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>, 
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

Sean Christopherson <seanjc@...gle.com> writes:

> On Wed, Oct 01, 2025, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@...gle.com> writes:
>> >> I'd prefer not to have the module param choose between the use of
>> >> mem_attr_array and guest_memfd conversion in case we need both
>> >> mem_attr_array to support other stuff in future while supporting
>> >> conversions.
>> >
>> > Luckily, we don't actually need to make a decision on this, because PRIVATE is
>> > the only attribute that exists.  Which is partly why I want to go with a module
>> > param.  We can make the behavior very definitive without significant risk of
>> > causing ABI hell.
>> >
>> 
>> Then maybe I'm misunderstanding the static_call() thing you were
>> describing. Is it like, at KVM module initialization time,
>> 
>>     if module_param == disable_tracking:
>>         .__kvm_get_memory_attributes = read_attributes_from_guest_memfd
>>     else
>>         .__kvm_get_memory_attributes = read_attributes_from_mem_attr_array
>> 
>> With that, I can't have both CoCo private/shared state tracked in
>> guest_memfd and RWX (as an example, could be any future attribute)
>> tracked in mem_attr_array on the same VM.
>
> More or less.
>

Hm okay. So introducing the module param will only allow the use of one
of the following?

+ KVM_SET_MEMORY_ATTRIBUTES (vm ioctl)
+ KVM_SET_MEMORY_ATTRIBUTES2 (guest_memfd ioctl)

Then I guess using a module param which is a weaker userspace contract
allows us to later enable both vm and guest_memfd ioctl if the need
arises?

>> > It's entirely possible I'm completely wrong and we'll end up with per-VM RWX
>> > protections and no other per-gmem memory attributes, but as above, unwinding or
>> > adjusting the module param will be a drop in the bucket compared to the effort
>> > needed to add whatever support comes along.
>> >
>> 
>> Is a module param a weaker userspace contract such that the definition
>> for module params can be more flexibly adjusted?
>
> Yes, much weaker.
>

I have a new tool in my toolbox now :)

>> >> > The kvm_memory_attributes structure is compatible, all that's needed AFAICT is a
>> >> > union to clarify it's a pgoff instead of an address when used for guest_memfd.
>> >> >
>> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> >> > index 52f6000ab020..e0d8255ac8d2 100644
>> >> > --- a/include/uapi/linux/kvm.h
>> >> > +++ b/include/uapi/linux/kvm.h
>> >> > @@ -1590,7 +1590,10 @@ struct kvm_stats_desc {
>> >> >  #define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
>> >> >  
>> >> >  struct kvm_memory_attributes {
>> >> > -       __u64 address;
>> >> > +       union {
>> >> > +               __u64 address;
>> >> > +               __u64 offset;
>> >> > +       };
>> >> >         __u64 size;
>> >> >         __u64 attributes;
>> >> >         __u64 flags;
>> >> >
>> >> 
>> >> struct kvm_memory_attributes doesn't have room for reporting the offset
>> >> at which conversion failed (error_offset in the new struct). How do we
>> >> handle this? Do we reuse the flags field, or do we not report
>> >> error_offset?
>> >
>> > Write back at address/offset
>> 
>> I think it might be surprising to the userspace program, when it wants
>> to check the offset that it had requested and found that it changed due
>> to an error, or upon decoding the error, be unable to find the original
>> offset it had requested.
>
> It's a somewhat common pattern in the kernel.  Updating the offset+size is most
> often used with -EAGAIN to say "got this far, try the syscall again from this
> point".
>

TIL, thanks!

>> Like,
>> 
>>     printf("Error during conversion from offset=%lx with size=%lx, at
>>            error_offset=%lx", attr.offset, attr.size, attr.error_offset)
>> 
>> would be nicer than 
>> 
>>     original_offset = attr.offset
>>     printf("Error during conversion from offset=%lx with size=%lx, at
>>            error_offset=%lx", original_offset, attr.size, attr.error_offset)
>>            
>> > (and update size too, which I probably forgot to do).
>> 
>> Why does size need to be updated? I think u64 for size is great, and
>> size is better than nr_pages since nr_pages differs on different
>> platforms based on PAGE_SIZE and also nr_pages introduces the question
>> of "was it hugetlb, or a native page size?".
>
> I meant update the number of bytes remaining when updating the offset so that
> userspace can redo the ioctl without having to update parameters.
>
>> > Ugh, but it's defined _IOW.  I forget if that matters in practice (IIRC, it's not
>> > enforced anywhere, i.e. purely informational for userspace).
>> >
>> 
>> I didn't notice this IOW vs IORW part, but if it starts getting
>> enforced/specified [1] or auto-documented we'd be in trouble.
>
> IOW vs IORW is alread specified in the ioctl.  More below.
>
>> At this point, maybe it's better to just have a different ioctl number
>> and struct definition. I feel that it would be easier for a user to
>> associate/separate
>
> Amusingly, we'd only need a different name along with the IORW thing.  A full
> ioctl number is comproised of the "directory" (KVM), the number, the size of the
> payload, and how the payload is accessed.
>
> #define _IOC(dir,type,nr,size) \
> 	(((dir)  << _IOC_DIRSHIFT) | \
> 	 ((type) << _IOC_TYPESHIFT) | \
> 	 ((nr)   << _IOC_NRSHIFT) | \
> 	 ((size) << _IOC_SIZESHIFT))
>
> So this:
>
>   #define KVM_SET_MEMORY_ATTRIBUTES	_IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
>   #define KVM_SET_MEMORY_ATTRIBUTES2	_IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
>
> actually generates two different values, and so is two different ioctls from a
> code perspective.
>
> The "size" of the payload is nice to have as it allows userspace to assert that
> it's passing the right structure, e.g. this static assert from KVM selftests:
>
> #define kvm_do_ioctl(fd, cmd, arg)						\
> ({										\
> 	kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd));	\
> 	ioctl(fd, cmd, arg);							\
> })
>
>> + KVM_SET_MEMORY_ATTRIBUTES
>>     + Is VM ioctl
>>     + Is a write-only ioctl
>>     + Is for setting memory attributes at a VM level
>>     + Use struct kvm_memory_attributes for this
>> + KVM_GUEST_MEMFD_SET_MEMORY_ATTRIBUTES (name TBD)
>>     + Is guest_memfd ioctl
>>     + Is a read/write ioctl
>>     + Is for setting memory attributes only for this guest_memfd
>>     + Use struct guest_memfd_memory_attributes for this
>>     + Also decode errors from this struct
>
>       + Has extra padding for future expansion (because why not)
>
> If we really truly need a new ioctl, I'd probably prefer KVM_SET_MEMORY_ATTRIBUTES2.
> Yeah, it's silly, but I don't think baking GUEST_MEMFD into the names buys us
> anything.  Then we can use KVM_SET_MEMORY_ATTRIBUTES2 on a VM if the need ever
> arises.
>

I'm for having a new ioctl number and new struct, which are you leaning
towards?

As for the naming, I think it's confusing to have something similar, and
Ira mentioned it being confusing in the other email too. At the same
time, I accept that it's useful if the same struct were to be used for a
new iteration of the KVM_SET_MEMORY_ATTRIBUTES VM ioctl in future. No
strong preference either way on naming.


Trying to understand the difference between unwind on failure vs
all-or-nothing:

> Alternative #1 is to try and unwind on failure, but that gets complex, and it
> simply can't be done for some CoCo VMs.  E.g. a private=>shared conversion for
> TDX is descrutive.
>

Unwind on failure is:

1. Store current state
2. Convert
3. Restore current state on conversion failure

> Alternative #2 is to make the updates atomic and all-or-nothing, which is what
> we did for per-VM attributes.  That's doable, but it'd either be much more
> complex than telling userspace to retry, or we'd have to lose the maple tree
> optimizations (which is effectively what we did for per-VM attributes).
>

All-or-nothing:

1. Do everything to make sure conversion doesn't fail, bail early if it
   fails
2. Convert (always successful)

Is that it?


Zapping private pages from the stage 2 page tables for TDX can't be
recovered without help from the guest (I think that's what you're
talking about too), although technically I think this zapping step could
be delayed right till the end.

Maple tree allocations for conversion could fail, and allocations are a
bit more complicated since we try to compact ranges with the same
shared/private status into one same maple tree node. Still technically
possible, maybe by updating a copy of the maple tree first, then
swapping the current maple tree out atomically.

With HugeTLB, undoing HVO needs pages to be allocated, will need more
digging into the details to determine if preallocation could work.

I'd still prefer having the option to return an error so that we don't
paint ourselves into a corner.

>> [1] https://lore.kernel.org/all/20250825181434.3340805-1-sashal@kernel.org/
>> 
>> 
>> [...snip...]
>>