[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGtprH-TChZuLgb0sOU_14YGpCynw7sukLT0tP9sEzzd040dHw@mail.gmail.com>
Date: Fri, 15 Aug 2025 18:56:36 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Steven Price <steven.price@....com>
Cc: kvm@...r.kernel.org, kvmarm@...ts.linux.dev,
Catalin Marinas <catalin.marinas@....com>, Marc Zyngier <maz@...nel.org>, Will Deacon <will@...nel.org>,
James Morse <james.morse@....com>, Oliver Upton <oliver.upton@...ux.dev>,
Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu <yuzenghui@...wei.com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Joey Gouly <joey.gouly@....com>, Alexandru Elisei <alexandru.elisei@....com>,
Christoffer Dall <christoffer.dall@....com>, Fuad Tabba <tabba@...gle.com>, linux-coco@...ts.linux.dev,
Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>, Gavin Shan <gshan@...hat.com>,
Shanker Donthineni <sdonthineni@...dia.com>, Alper Gun <alpergun@...gle.com>,
"Aneesh Kumar K . V" <aneesh.kumar@...nel.org>, Emi Kisanuki <fj0570is@...itsu.com>
Subject: Re: [PATCH v9 19/43] arm64: RME: Allow populating initial contents
On Fri, Aug 15, 2025 at 8:48 AM Steven Price <steven.price@....com> wrote:
>
> On 14/08/2025 17:26, Vishal Annapurve wrote:
> > On Wed, Aug 13, 2025 at 2:30 AM Steven Price <steven.price@....com> wrote:
> >>
> >> On 01/08/2025 02:56, Vishal Annapurve wrote:
> >>> On Wed, Jun 11, 2025 at 3:59 AM Steven Price <steven.price@....com> wrote:
> >>>>
> >>>> +static int realm_create_protected_data_page(struct realm *realm,
> >>>> + unsigned long ipa,
> >>>> + kvm_pfn_t dst_pfn,
> >>>> + kvm_pfn_t src_pfn,
> >>>> + unsigned long flags)
> >>>> +{
> >>>> + unsigned long rd = virt_to_phys(realm->rd);
> >>>> + phys_addr_t dst_phys, src_phys;
> >>>> + bool undelegate_failed = false;
> >>>> + int ret, offset;
> >>>> +
> >>>> + dst_phys = __pfn_to_phys(dst_pfn);
> >>>> + src_phys = __pfn_to_phys(src_pfn);
> >>>> +
> >>>> + for (offset = 0; offset < PAGE_SIZE; offset += RMM_PAGE_SIZE) {
> >>>> + ret = realm_create_protected_data_granule(realm,
> >>>> + ipa,
> >>>> + dst_phys,
> >>>> + src_phys,
> >>>> + flags);
> >>>> + if (ret)
> >>>> + goto err;
> >>>> +
> >>>> + ipa += RMM_PAGE_SIZE;
> >>>> + dst_phys += RMM_PAGE_SIZE;
> >>>> + src_phys += RMM_PAGE_SIZE;
> >>>> + }
> >>>> +
> >>>> + return 0;
> >>>> +
> >>>> +err:
> >>>> + if (ret == -EIO) {
> >>>> + /* current offset needs undelegating */
> >>>> + if (WARN_ON(rmi_granule_undelegate(dst_phys)))
> >>>> + undelegate_failed = true;
> >>>> + }
> >>>> + while (offset > 0) {
> >>>> + ipa -= RMM_PAGE_SIZE;
> >>>> + offset -= RMM_PAGE_SIZE;
> >>>> + dst_phys -= RMM_PAGE_SIZE;
> >>>> +
> >>>> + rmi_data_destroy(rd, ipa, NULL, NULL);
> >>>> +
> >>>> + if (WARN_ON(rmi_granule_undelegate(dst_phys)))
> >>>> + undelegate_failed = true;
> >>>> + }
> >>>> +
> >>>> + if (undelegate_failed) {
> >>>> + /*
> >>>> + * A granule could not be undelegated,
> >>>> + * so the page has to be leaked
> >>>> + */
> >>>> + get_page(pfn_to_page(dst_pfn));
> >>>
> >>> I would like to point out that the support for in-place conversion
> >>> with guest_memfd using hugetlb pages [1] is under discussion.
> >>>
> >>> As part of the in-place conversion, the policy we are routing for is
> >>> to avoid any "refcounts" from KVM on folios supplied by guest_memfd as
> >>> in-place conversion works by splitting and merging folios during
> >>> memory conversion as per discussion at LPC [2].
> >>
> >> CCA doesn't really support "in-place" conversions (see more detail
> >> below). But here the issue is that something has gone wrong and the RMM
> >> is refusing to give us a page back.
> >
> > I think I overloaded the term "in-place" conversion in this context. I
> > was talking about supporting "in-place" conversion without data
> > preservation. i.e. Host will use the same GPA->HPA range mapping even
> > after conversions, ensuring single backing for guest memory. This is
> > achieved by guest_memfd keeping track of private/shared ranges based
> > on userspace IOCTLs to change the tracking metadata.
>
> Yes, so for a destructive conversion this is fine. We can remove the
> page from the protected region and then place the same physical page in
> the shared region (or vice versa).
>
> Population is a special case because it's effectively non-destructive,
> and in that case we need both the reference data and the final
> (protected) physical page both available at same time.
>
> >>
> >>>
> >>> The best way to avoid further use of this page with huge page support
> >>> around would be either:
> >>> 1) Explicitly Inform guest_memfd of a particular pfn being in use by
> >>> KVM without relying on page refcounts or
> >>
> >> This might work, but note that the page is unavailable even after user
> >> space has freed the guest_memfd. So at some point the page needs to be
> >> marked so that it cannot be reallocated by the kernel. Holding a
> >> refcount isn't ideal but I haven't come up with a better idea.
> >>
> >> Note that this is a "should never happen" situation - the code will have
> >> WARN()ed already - so this is just a best effort to allow the system to
> >> limp on.
> >>
> >>> 2) Set the page as hwpoisoned. (Needs further discussion)
> >>
> >> This certainly sounds like a closer fit - but I'm not very familiar with
> >> hwpoison so I don't know how easy it would be to integrate with this.
> >>
> >
> > We had similar discussions with Intel specific SEPT management and the
> > conclusion there was to just not hold refcounts and give a warning on
> > such failures [1].
> >
> > [1] https://lore.kernel.org/kvm/20250807094241.4523-1-yan.y.zhao@intel.com/
>
> So these paths (should) all warn already. I guess the question is
> whether we want the platform to limp on in these situations or not.
> Certainly converting the WARNs to BUG_ON would be very easy, but the
> intention here was to give the user some chance to save their work
> before killing the system.
>
> Just WARNing might be ok, but if the kernel allocates the page for one
> of it's data structures then it's effectively a BUG_ON - there's no
> reasonable recovery. Reallocation into user space we can sort of handle,
> but only by killing the process.
Makes sense, this scenario is different from TDX as the host will run
into GPT faults if accessing memory owned by the realm world with no
good way to recover. Let's try to find the right complexity to take on
when handling errors for such rare scenarios, without relying on
refcounts as we start looking into huge page support.
>
> >>> This page refcounting strategy will have to be revisited depending on
> >>> which series lands first. That being said, it would be great if ARM
> >>> could review/verify if the series [1] works for backing CCA VMs with
> >>> huge pages.
> >>>
> >>> [1] https://lore.kernel.org/kvm/cover.1747264138.git.ackerleytng@google.com/
> >>> [2] https://lpc.events/event/18/contributions/1764/
Powered by blists - more mailing lists