[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGtprH_6DYk8POPy+sLc3RL0-5gcrTdPNcDWFTssOK5_U4B3Nw@mail.gmail.com>
Date: Thu, 14 Aug 2025 09:26:43 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Steven Price <steven.price@....com>
Cc: kvm@...r.kernel.org, kvmarm@...ts.linux.dev,
Catalin Marinas <catalin.marinas@....com>, Marc Zyngier <maz@...nel.org>, Will Deacon <will@...nel.org>,
James Morse <james.morse@....com>, Oliver Upton <oliver.upton@...ux.dev>,
Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu <yuzenghui@...wei.com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Joey Gouly <joey.gouly@....com>, Alexandru Elisei <alexandru.elisei@....com>,
Christoffer Dall <christoffer.dall@....com>, Fuad Tabba <tabba@...gle.com>, linux-coco@...ts.linux.dev,
Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>, Gavin Shan <gshan@...hat.com>,
Shanker Donthineni <sdonthineni@...dia.com>, Alper Gun <alpergun@...gle.com>,
"Aneesh Kumar K . V" <aneesh.kumar@...nel.org>, Emi Kisanuki <fj0570is@...itsu.com>
Subject: Re: [PATCH v9 19/43] arm64: RME: Allow populating initial contents
On Wed, Aug 13, 2025 at 2:30 AM Steven Price <steven.price@....com> wrote:
>
> On 01/08/2025 02:56, Vishal Annapurve wrote:
> > On Wed, Jun 11, 2025 at 3:59 AM Steven Price <steven.price@....com> wrote:
> >>
> >> +static int realm_create_protected_data_page(struct realm *realm,
> >> + unsigned long ipa,
> >> + kvm_pfn_t dst_pfn,
> >> + kvm_pfn_t src_pfn,
> >> + unsigned long flags)
> >> +{
> >> + unsigned long rd = virt_to_phys(realm->rd);
> >> + phys_addr_t dst_phys, src_phys;
> >> + bool undelegate_failed = false;
> >> + int ret, offset;
> >> +
> >> + dst_phys = __pfn_to_phys(dst_pfn);
> >> + src_phys = __pfn_to_phys(src_pfn);
> >> +
> >> + for (offset = 0; offset < PAGE_SIZE; offset += RMM_PAGE_SIZE) {
> >> + ret = realm_create_protected_data_granule(realm,
> >> + ipa,
> >> + dst_phys,
> >> + src_phys,
> >> + flags);
> >> + if (ret)
> >> + goto err;
> >> +
> >> + ipa += RMM_PAGE_SIZE;
> >> + dst_phys += RMM_PAGE_SIZE;
> >> + src_phys += RMM_PAGE_SIZE;
> >> + }
> >> +
> >> + return 0;
> >> +
> >> +err:
> >> + if (ret == -EIO) {
> >> + /* current offset needs undelegating */
> >> + if (WARN_ON(rmi_granule_undelegate(dst_phys)))
> >> + undelegate_failed = true;
> >> + }
> >> + while (offset > 0) {
> >> + ipa -= RMM_PAGE_SIZE;
> >> + offset -= RMM_PAGE_SIZE;
> >> + dst_phys -= RMM_PAGE_SIZE;
> >> +
> >> + rmi_data_destroy(rd, ipa, NULL, NULL);
> >> +
> >> + if (WARN_ON(rmi_granule_undelegate(dst_phys)))
> >> + undelegate_failed = true;
> >> + }
> >> +
> >> + if (undelegate_failed) {
> >> + /*
> >> + * A granule could not be undelegated,
> >> + * so the page has to be leaked
> >> + */
> >> + get_page(pfn_to_page(dst_pfn));
> >
> > I would like to point out that the support for in-place conversion
> > with guest_memfd using hugetlb pages [1] is under discussion.
> >
> > As part of the in-place conversion, the policy we are routing for is
> > to avoid any "refcounts" from KVM on folios supplied by guest_memfd as
> > in-place conversion works by splitting and merging folios during
> > memory conversion as per discussion at LPC [2].
>
> CCA doesn't really support "in-place" conversions (see more detail
> below). But here the issue is that something has gone wrong and the RMM
> is refusing to give us a page back.
I think I overloaded the term "in-place" conversion in this context. I
was talking about supporting "in-place" conversion without data
preservation. i.e. Host will use the same GPA->HPA range mapping even
after conversions, ensuring single backing for guest memory. This is
achieved by guest_memfd keeping track of private/shared ranges based
on userspace IOCTLs to change the tracking metadata.
>
> >
> > The best way to avoid further use of this page with huge page support
> > around would be either:
> > 1) Explicitly Inform guest_memfd of a particular pfn being in use by
> > KVM without relying on page refcounts or
>
> This might work, but note that the page is unavailable even after user
> space has freed the guest_memfd. So at some point the page needs to be
> marked so that it cannot be reallocated by the kernel. Holding a
> refcount isn't ideal but I haven't come up with a better idea.
>
> Note that this is a "should never happen" situation - the code will have
> WARN()ed already - so this is just a best effort to allow the system to
> limp on.
>
> > 2) Set the page as hwpoisoned. (Needs further discussion)
>
> This certainly sounds like a closer fit - but I'm not very familiar with
> hwpoison so I don't know how easy it would be to integrate with this.
>
We had similar discussions with Intel specific SEPT management and the
conclusion there was to just not hold refcounts and give a warning on
such failures [1].
[1] https://lore.kernel.org/kvm/20250807094241.4523-1-yan.y.zhao@intel.com/
> > This page refcounting strategy will have to be revisited depending on
> > which series lands first. That being said, it would be great if ARM
> > could review/verify if the series [1] works for backing CCA VMs with
> > huge pages.
> >
> > [1] https://lore.kernel.org/kvm/cover.1747264138.git.ackerleytng@google.com/
> > [2] https://lpc.events/event/18/contributions/1764/
> >
> >> + }
> >> +
> >> + return -ENXIO;
> >> +}
> >> +
> >> +static int populate_region(struct kvm *kvm,
> >> + phys_addr_t ipa_base,
> >> + phys_addr_t ipa_end,
> >> + unsigned long data_flags)
> >> +{
> >> + struct realm *realm = &kvm->arch.realm;
> >> + struct kvm_memory_slot *memslot;
> >> + gfn_t base_gfn, end_gfn;
> >> + int idx;
> >> + phys_addr_t ipa = ipa_base;
> >> + int ret = 0;
> >> +
> >> + base_gfn = gpa_to_gfn(ipa_base);
> >> + end_gfn = gpa_to_gfn(ipa_end);
> >> +
> >> + idx = srcu_read_lock(&kvm->srcu);
> >> + memslot = gfn_to_memslot(kvm, base_gfn);
> >> + if (!memslot) {
> >> + ret = -EFAULT;
> >> + goto out;
> >> + }
> >> +
> >> + /* We require the region to be contained within a single memslot */
> >> + if (memslot->base_gfn + memslot->npages < end_gfn) {
> >> + ret = -EINVAL;
> >> + goto out;
> >> + }
> >> +
> >> + if (!kvm_slot_can_be_private(memslot)) {
> >> + ret = -EPERM;
> >> + goto out;
> >> + }
> >> +
> >> + while (ipa < ipa_end) {
> >> + struct vm_area_struct *vma;
> >> + unsigned long hva;
> >> + struct page *page;
> >> + bool writeable;
> >> + kvm_pfn_t pfn;
> >> + kvm_pfn_t priv_pfn;
> >> + struct page *gmem_page;
> >> +
> >> + hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa));
> >> + vma = vma_lookup(current->mm, hva);
> >> + if (!vma) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + pfn = __kvm_faultin_pfn(memslot, gpa_to_gfn(ipa), FOLL_WRITE,
> >> + &writeable, &page);
> >
> > Is this assuming double backing of guest memory ranges? Is this logic
> > trying to simulate a shared fault?
>
> Yes and yes...
>
> > Does memory population work with CCA if priv_pfn and pfn are the same?
> > I am curious how the memory population will work with in-place
> > conversion support available for guest_memfd files.
>
> The RMM interface doesn't support an in-place conversion. The
> RMI_DATA_CREATE operation takes the PA of the already delegated
> granule[1] along with the PA of a non-delegated granule with the data.
Ok, I think this will need a source buffer from userspace that is
outside guest_memfd once guest_memfd will support a single backing for
guest memory. You might want to simulate private access fault for
destination GPAs backed by guest_memfd ranges for this initial data
population -> similar to how memory population works today with TDX
VMs.
Note that with single backing around, at least for x86, KVM
shared/private stage2 faults will always be served using guest_memfd
as the source of truth (i.e. not via userspace pagetables for shared
faults).
>
> So to mimic an in-place conversion requires copying the data from the
> page, delegating the (original) page and then using RMI_DATA_CREATE
> which copies the data back. Fundamentally because there may be memory
> encryption involved there is going to be a requirement for this double
> memcpy() approach. Note that this is only relevant during the initial
> setup phase - CCA doesn't (at least currently) permit populated pages to
> be provided to the guest when it is running.
>
> The approach this series takes pre-dates the guest_memfd discussions and
> so is assuming that the shared memory is not (directly) provided by the
> guest_memfd but is using the user space pointer provided in the memslot.
> It would be possible (with the patches proposed) for the VMM to mmap()
> the guest_memfd when the memory is being shared so as to reuse the
> physical pages.
>
> I do also plan to look at supporting the use of the guest_memfd for the
> shared memory directly. But I've been waiting for the discussions to
> conclude before attempting to implement that.
>
> [1] A 'granule' is the RMM's idea of a page size (RMM_PAGE_SIZE), which
> is currently (RMM v1.0) always 4k. So may be different when Linux is
> running with a larger page size.
>
> Thanks,
> Steve
>
Powered by blists - more mailing lists