linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aN8P87AXlxlEDdpP@google.com>
Date: Thu, 2 Oct 2025 16:51:15 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Yan Zhao <yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, 
	Binbin Wu <binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, 
	Ira Weiny <ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>, 
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

On Thu, Oct 02, 2025, Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> 
> > On Wed, Oct 01, 2025, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@...gle.com> writes:
> >> >> I'd prefer not to have the module param choose between the use of
> >> >> mem_attr_array and guest_memfd conversion in case we need both
> >> >> mem_attr_array to support other stuff in future while supporting
> >> >> conversions.
> >> >
> >> > Luckily, we don't actually need to make a decision on this, because PRIVATE is
> >> > the only attribute that exists.  Which is partly why I want to go with a module
> >> > param.  We can make the behavior very definitive without significant risk of
> >> > causing ABI hell.
> >> >
> >> 
> >> Then maybe I'm misunderstanding the static_call() thing you were
> >> describing. Is it like, at KVM module initialization time,
> >> 
> >>     if module_param == disable_tracking:
> >>         .__kvm_get_memory_attributes = read_attributes_from_guest_memfd
> >>     else
> >>         .__kvm_get_memory_attributes = read_attributes_from_mem_attr_array
> >> 
> >> With that, I can't have both CoCo private/shared state tracked in
> >> guest_memfd and RWX (as an example, could be any future attribute)
> >> tracked in mem_attr_array on the same VM.
> >
> > More or less.
> >
> 
> Hm okay. So introducing the module param will only allow the use of one
> of the following?
> 
> + KVM_SET_MEMORY_ATTRIBUTES (vm ioctl)
> + KVM_SET_MEMORY_ATTRIBUTES2 (guest_memfd ioctl)
> 
> Then I guess using a module param which is a weaker userspace contract
> allows us to later enable both vm and guest_memfd ioctl if the need
> arises?

In theory.  More importantly from my perspective, making the knob global instead
of per-VM simplifies the implementation.

> >> + KVM_SET_MEMORY_ATTRIBUTES
> >>     + Is VM ioctl
> >>     + Is a write-only ioctl
> >>     + Is for setting memory attributes at a VM level
> >>     + Use struct kvm_memory_attributes for this
> >> + KVM_GUEST_MEMFD_SET_MEMORY_ATTRIBUTES (name TBD)
> >>     + Is guest_memfd ioctl
> >>     + Is a read/write ioctl
> >>     + Is for setting memory attributes only for this guest_memfd
> >>     + Use struct guest_memfd_memory_attributes for this
> >>     + Also decode errors from this struct
> >
> >       + Has extra padding for future expansion (because why not)
> >
> > If we really truly need a new ioctl, I'd probably prefer KVM_SET_MEMORY_ATTRIBUTES2.
> > Yeah, it's silly, but I don't think baking GUEST_MEMFD into the names buys us
> > anything.  Then we can use KVM_SET_MEMORY_ATTRIBUTES2 on a VM if the need ever
> > arises.
> >
> 
> I'm for having a new ioctl number and new struct, which are you leaning
> towards?
> 
> As for the naming, I think it's confusing to have something similar, and
> Ira mentioned it being confusing in the other email too.

It might sound confusing, but I don't think the _code_ will be confusing.  If it's
a module param, it's guaranteed to be global for any VMM.  Then the conversion code
can be a simple redirect, e.g. I would expect/hope the QEMU change to be something
like:

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f89568bfa3..5253cf7275 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1421,6 +1421,10 @@ static int kvm_set_memory_attributes(hwaddr start, uint64_t size, uint64_t attr)
     int r;
 
     assert((attr & kvm_supported_memory_attributes) == attr);
+
+    if (kvm_use_gmem_attributes)
+        return kvm_get_gmem_fd(start, size, attr);
+
     attrs.attributes = attr;
     attrs.address = start;
     attrs.size = size;

> At the same time, I accept that it's useful if the same struct were to be
> used for a new iteration of the KVM_SET_MEMORY_ATTRIBUTES VM ioctl in future.
> No strong preference either way on naming.
> 
> Trying to understand the difference between unwind on failure vs
> all-or-nothing:
> 
> > Alternative #1 is to try and unwind on failure, but that gets complex, and it
> > simply can't be done for some CoCo VMs.  E.g. a private=>shared conversion for
> > TDX is descrutive.
> >
> 
> Unwind on failure is:
> 
> 1. Store current state
> 2. Convert
> 3. Restore current state on conversion failure

Not quite, the above is missing key steps that cause problems: invalidation and
memory allocation.

 1. Zap stage-1 mappings
 2. Check if range can be converted
 3. Zap stage-2 mappings
 4. Store state for target range

For TDX, #3 is the point of no return.  That means #4 must not fail, because at
the very least invididual page conversions need to be atomic.

> > Alternative #2 is to make the updates atomic and all-or-nothing, which is what
> > we did for per-VM attributes.  That's doable, but it'd either be much more
> > complex than telling userspace to retry, or we'd have to lose the maple tree
> > optimizations (which is effectively what we did for per-VM attributes).
> >
> 
> All-or-nothing:
> 
> 1. Do everything to make sure conversion doesn't fail, bail early if it
>    fails
> 2. Convert (always successful)
> 
> Is that it?

No, because #1 is non-trivial.  The wrinkle is that, without doing an initial
pass through mtree, it's impossible to know how many new entries will be needed
(which you already know).  It's a solvable problem,  e.g. in this RFC, it's kinda
sorta handled by allocating temporary structures to track the mtree metadata, but
that's the type of complexity I want to avoid.

> Zapping private pages from the stage 2 page tables for TDX can't be
> recovered without help from the guest (I think that's what you're
> talking about too), 

Ya.

> although technically I think this zapping step could be delayed right till the end.

Hmm.  If we fully committed to blocking all "gets" via the filemap lock, then
yes, I think it could be delayed until the end?  Tempting.   But as you note
below, I'd prefer to give ourselves an out from an ABI perspective.

> Maple tree allocations for conversion could fail, 

Ya, in my local version I handle that by pre-allocating.

	mas_for_each(&mas, entry, end - 1) {
		MA_STATE(m2, &gi->attributes, 0, 0);

		if (attrs->attributes == xa_to_value(entry))
			continue;

		r = kvm_gmem_mas_preallocate(&m2, attrs->attributes, start, end);
		if (r) {
			*err_index = m2.index;
			goto out;
		}

		unmap_mapping_pages(mapping, start, nr_pages, false);

		if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages, err_index)) {
			mas_destroy(&m2);
			r = -EAGAIN;
			goto out;
		}

		kvm_gmem_invalidate_begin(inode, start, end);

		mas_store_prealloc(&m2, xa_mk_value(attrs->attributes));

		kvm_gmem_invalidate_end(inode, start, end);
	}

> and allocations are a bit more complicated since we try to compact ranges
> with the same shared/private status into one same maple tree node. Still
> technically possible, maybe by updating a copy of the maple tree first, then
> swapping the current maple tree out atomically.

Heh, more complexity I'd prefer to avoid.

> With HugeTLB, undoing HVO needs pages to be allocated, will need more
> digging into the details to determine if preallocation could work.

It has to work, otherwise we're hosed.  Though it should be noted that "preallocate"
here just means before kvm_gmem_invalidate_begin().  Everything up to that point
can fail.

The other option in all of this is to add an API to _block_ stage-2 mappings, in
TDX terminology.  In TDX, BLOCK marks leaf S-EPT entries !PRESENT, but preserves
all the metadata.  So if we're too scared to do an after-the-fact invalidation,
and we need to support failure after kvm_gmem_invalidate_begin(), we could rework
the TDX backend to support e.g. kvm_gmem_invalidate_abort().  On begin(), KVM 
would BLOCK mappings; on abort(), restore; on end() fully remove.

> I'd still prefer having the option to return an error so that we don't
> paint ourselves into a corner.

Yep, my thoughts exactly.