linux-kernel - Re: [PATCH v9 19/43] arm64: RME: Allow populating initial contents

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9rbW173qZyC5_cHkKT5J4YDSg8itFcR2VZvSY88fsGrQ@mail.gmail.com>
Date: Fri, 15 Aug 2025 11:18:15 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Steven Price <steven.price@....com>
Cc: kvm@...r.kernel.org, kvmarm@...ts.linux.dev, 
	Catalin Marinas <catalin.marinas@....com>, Marc Zyngier <maz@...nel.org>, Will Deacon <will@...nel.org>, 
	James Morse <james.morse@....com>, Oliver Upton <oliver.upton@...ux.dev>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu <yuzenghui@...wei.com>, 
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	Joey Gouly <joey.gouly@....com>, Alexandru Elisei <alexandru.elisei@....com>, 
	Christoffer Dall <christoffer.dall@....com>, Fuad Tabba <tabba@...gle.com>, linux-coco@...ts.linux.dev, 
	Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>, Gavin Shan <gshan@...hat.com>, 
	Shanker Donthineni <sdonthineni@...dia.com>, Alper Gun <alpergun@...gle.com>, 
	"Aneesh Kumar K . V" <aneesh.kumar@...nel.org>, Emi Kisanuki <fj0570is@...itsu.com>
Subject: Re: [PATCH v9 19/43] arm64: RME: Allow populating initial contents

On Fri, Aug 15, 2025 at 8:48 AM Steven Price <steven.price@....com> wrote:
>
> >>>> +static int populate_region(struct kvm *kvm,
> >>>> +                          phys_addr_t ipa_base,
> >>>> +                          phys_addr_t ipa_end,
> >>>> +                          unsigned long data_flags)
> >>>> +{
> >>>> +       struct realm *realm = &kvm->arch.realm;
> >>>> +       struct kvm_memory_slot *memslot;
> >>>> +       gfn_t base_gfn, end_gfn;
> >>>> +       int idx;
> >>>> +       phys_addr_t ipa = ipa_base;
> >>>> +       int ret = 0;
> >>>> +
> >>>> +       base_gfn = gpa_to_gfn(ipa_base);
> >>>> +       end_gfn = gpa_to_gfn(ipa_end);
> >>>> +
> >>>> +       idx = srcu_read_lock(&kvm->srcu);
> >>>> +       memslot = gfn_to_memslot(kvm, base_gfn);
> >>>> +       if (!memslot) {
> >>>> +               ret = -EFAULT;
> >>>> +               goto out;
> >>>> +       }
> >>>> +
> >>>> +       /* We require the region to be contained within a single memslot */
> >>>> +       if (memslot->base_gfn + memslot->npages < end_gfn) {
> >>>> +               ret = -EINVAL;
> >>>> +               goto out;
> >>>> +       }
> >>>> +
> >>>> +       if (!kvm_slot_can_be_private(memslot)) {
> >>>> +               ret = -EPERM;
> >>>> +               goto out;
> >>>> +       }
> >>>> +
> >>>> +       while (ipa < ipa_end) {
> >>>> +               struct vm_area_struct *vma;
> >>>> +               unsigned long hva;
> >>>> +               struct page *page;
> >>>> +               bool writeable;
> >>>> +               kvm_pfn_t pfn;
> >>>> +               kvm_pfn_t priv_pfn;
> >>>> +               struct page *gmem_page;
> >>>> +
> >>>> +               hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa));
> >>>> +               vma = vma_lookup(current->mm, hva);
> >>>> +               if (!vma) {
> >>>> +                       ret = -EFAULT;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               pfn = __kvm_faultin_pfn(memslot, gpa_to_gfn(ipa), FOLL_WRITE,
> >>>> +                                       &writeable, &page);
> >>>
> >>> Is this assuming double backing of guest memory ranges? Is this logic
> >>> trying to simulate a shared fault?
> >>
> >> Yes and yes...
> >>
> >>> Does memory population work with CCA if priv_pfn and pfn are the same?
> >>> I am curious how the memory population will work with in-place
> >>> conversion support available for guest_memfd files.
> >>
> >> The RMM interface doesn't support an in-place conversion. The
> >> RMI_DATA_CREATE operation takes the PA of the already delegated
> >> granule[1] along with the PA of a non-delegated granule with the data.
> >
> > Ok, I think this will need a source buffer from userspace that is
> > outside guest_memfd once guest_memfd will support a single backing for
> > guest memory. You might want to simulate private access fault for
> > destination GPAs backed by guest_memfd ranges for this initial data
> > population -> similar to how memory population works today with TDX
> > VMs.
>
> Yes, that might well be the best option. At the moment I'm not sure what
> the best approach from the perspective of the VMM is. The current setup
> is nice because the VMM can populate the VM just like a normal non-coco
> setup, so we don't need to special-case anything. Requiring the VMM to
> load the initial contents into some other memory for the purpose of the
> populate call is a little ugly.

IIUC ARM CCA architecture requires two buffers for initial memory
population irrespective of whether we do it within kernel or in
userspace.

IMO, doing it in the kernel is uglier than doing it in userspace.
Baking the requirement of passing two buffers into userspace ABI for
populate seems cleaner to me.

>
> The other option is to just return to the double-memcpy() approach of
> emulating an in-place content-preserving populate by the kernel copying
> the data out of guest_memfd into a temporary page before doing the
> conversion and the RMM copying back from the temporary page. I worry
> about the performance of this, although it doesn't prevent the first
> option being added in the future if necessary. The big benefit is that
> it goes back to looking like a non-coco VM (but using guest_memfd).

Populate path is CoCo specific, I don't see the benefit of trying to
match the behavior with non-coco VMs in userspace.

>
> > Note that with single backing around, at least for x86, KVM
> > shared/private stage2 faults will always be served using guest_memfd
> > as the source of truth (i.e. not via userspace pagetables for shared
> > faults).
>
> Indeed, it's not yet clear to me whether we can only support this
> "single backing" world for CCA, or if it's going to be an option
> alongside using guest_memfd only for protected memory. Obviously we need
> the guest_memfd changes to land first too...

Having a dual backing support just for supporting the initial populate
seems overkill to me. How much payload in general are we talking about
in terms of percentage of total memory size?

IMO, dual backing for guest_memfd was a stopgap solution and should be
deprecated once we have single memory backing support.
CoCo VMs should default to using single backing going forward. Mmap
support [1] for guest_memfd is close to getting merged. In-place
conversion support is likely to follow soon.

[1] https://lore.kernel.org/all/20250729225455.670324-1-seanjc@google.com/

>
> At the moment I'm planning to post a v10 of this series soon, keeping
> the same API. But we will definitely want to support the new guest_memfd
> approach when it's ready.
>
> Thanks,
> Steve
>