[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMlly_9kDE04Inn2@google.com>
Date: Tue, 16 Sep 2025 13:27:39 +0000
From: Mostafa Saleh <smostafa@...gle.com>
To: Will Deacon <will@...nel.org>
Cc: linux-kernel@...r.kernel.org, kvmarm@...ts.linux.dev,
linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
maz@...nel.org, oliver.upton@...ux.dev, joey.gouly@....com,
suzuki.poulose@....com, yuzenghui@...wei.com,
catalin.marinas@....com, robin.murphy@....com,
jean-philippe@...aro.org, qperret@...gle.com, tabba@...gle.com,
jgg@...pe.ca, mark.rutland@....com, praan@...gle.com
Subject: Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
On Tue, Sep 09, 2025 at 03:12:45PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> > Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> > drivers can use that to protect the MMIO of IOMMU.
> > The initial attempt to implement this was to have a new flag to
> > "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> > it was quite intrusive for host/hyp to check/set page state to make it
> > aware of MMIO and to encode the state in the page table in that case.
> > Which is called in paths that can be sensitive to performance (FFA, VMs..)
> >
> > As donating MMIO is very rare, and we don’t need to encode the full state,
> > it’s reasonable to have a separate function to do this.
> > It will init the host s2 page table with an invalid leaf with the owner ID
> > to prevent the host from mapping the page on faults.
> >
> > Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> > stage-2 PTEs, as this can be triggered from recycle logic under memory
> > pressure. There is no code relying on this, as all ownership changes is
> > done via kvm_pgtable_stage2_set_owner()
> >
> > For error path in IOMMU drivers, add a function to donate MMIO back
> > from hyp to host.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@...gle.com>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 64 +++++++++++++++++++
> > arch/arm64/kvm/hyp/pgtable.c | 9 +--
> > 3 files changed, 68 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > index 52d7ee91e18c..98e173da0f9b 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > @@ -37,6 +37,8 @@ int __pkvm_host_share_hyp(u64 pfn);
> > int __pkvm_host_unshare_hyp(u64 pfn);
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> > int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> > +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> > +int __pkvm_hyp_donate_host_mmio(u64 pfn);
> > int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> > int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> > int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 861e448183fd..c9a15ef6b18d 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > return ret;
> > }
> >
> > +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> > +{
> > + u64 phys = hyp_pfn_to_phys(pfn);
> > + void *virt = __hyp_va(phys);
> > + int ret;
> > + kvm_pte_t pte;
> > +
> > + host_lock_component();
> > + hyp_lock_component();
> > +
> > + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> > + if (ret)
> > + goto unlock;
> > +
> > + if (pte && !kvm_pte_valid(pte)) {
> > + ret = -EPERM;
> > + goto unlock;
> > + }
>
> Shouldn't we first check that the pfn is indeed MMIO? Otherwise, testing
> the pte for the ownership information isn't right.
I will add it, although the input should be trusted as it comes from the
hypervisor SMMUv3 driver.
>
> > + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> > + if (ret)
> > + goto unlock;
> > + if (pte) {
> > + ret = -EBUSY;
> > + goto unlock;
> > + }
> > +
> > + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> > + if (ret)
> > + goto unlock;
> > + /*
> > + * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> > + * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> > + * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> > + * kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> > + * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> > + */
> > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> > +unlock:
> > + hyp_unlock_component();
> > + host_unlock_component();
> > +
> > + return ret;
> > +}
> > +
> > +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> > +{
> > + u64 phys = hyp_pfn_to_phys(pfn);
> > + u64 virt = (u64)__hyp_va(phys);
> > + size_t size = PAGE_SIZE;
> > +
> > + host_lock_component();
> > + hyp_lock_component();
>
> Shouldn't we check that:
>
> 1. pfn is mmio
> 2. pfn is owned by hyp
> 3. The host doesn't have something mapped at pfn already
>
> ?
>
I thought about this initially, but as
- This code is only called from the hypervisor with trusted
inputs (only at boot)
- Only called on error path
So WARN_ON in case of failure to unmap MMIO pages seemed is good enough,
to avoid extra code.
But I can add the checks if you think they are necessary, we will need
to add new helpers for MMIO state though.
> > + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> > + hyp_unlock_component();
> > + host_unlock_component();
> > +
> > + return 0;
> > +}
> > +
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > {
> > return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index c351b4abd5db..ba06b0c21d5a 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > kvm_pte_t *childp = NULL;
> > bool need_flush = false;
> >
> > - if (!kvm_pte_valid(ctx->old)) {
> > - if (stage2_pte_is_counted(ctx->old)) {
> > - kvm_clear_pte(ctx->ptep);
> > - mm_ops->put_page(ctx->ptep);
> > - }
> > - return 0;
> > - }
> > + if (!kvm_pte_valid(ctx->old))
> > + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
>
> Can this code be reached for the guest? For example, if
> pkvm_pgtable_stage2_destroy() runs into an MMIO-guarded pte on teardown?
AFAICT, VMs page table is destroyed from reclaim_pgtable_pages() =>
kvm_pgtable_stage2_destroy() => kvm_pgtable_stage2_destroy_range() ... =>
stage2_free_walker()
Which doesn't interact with “stage2_unmap_walker”, so that should be
fine.
Thanks,
Mostafa
>
> Will
Powered by blists - more mailing lists