linux-kernel - Re: [PATCH v2] KVM: arm64: nv: Optimize unmapping of shadow S2-MMU tables

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86wm4fus4n.wl-maz@kernel.org>
Date: Tue, 28 Oct 2025 12:29:12 +0000
From: Marc Zyngier <maz@...nel.org>
To: Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>
Cc: linux-arm-kernel@...ts.infradead.org,
	kvmarm@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	oliver.upton@...ux.dev,
	will@...nel.org,
	catalin.marinas@....com,
	suzuki.poulose@....com,
	joey.gouly@....com,
	yuzenghui@...wei.com,
	darren@...amperecomputing.com,
	cl@...two.org,
	scott@...amperecomputing.com,
	gklkml16@...il.com
Subject: Re: [PATCH v2] KVM: arm64: nv: Optimize unmapping of shadow S2-MMU tables

On Tue, 28 Oct 2025 06:02:03 +0000,
Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com> wrote:
> 
> On 10/23/2025 8:05 PM, Marc Zyngier wrote:
> > On Thu, 23 Oct 2025 12:11:42 +0100,
> > Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com> wrote:
> >> 
> >> 
> >> Hi Marc, Oliver,
> >> 
> >> On 10/13/2025 12:21 PM, Ganapatrao Kulkarni wrote:
> >>> As of commit ec14c272408a ("KVM: arm64: nv: Unmap/flush shadow
> >>> stage 2 page tables"), an unmap of a canonical IPA range mapped at L1
> >>> triggers invalidation in L1 S2-MMU and in all active shadow (L2) S2-MMU
> >>> tables. Because there is no direct mapping to locate the corresponding
> >>> shadow IPAs, the code falls back to a full S2-MMU page-table walk and
> >>> invalidation across the entire L1 address space.
> >>> 
> >>> For 4K pages this causes roughly 256K loop iterations (about 8M for
> >>> 64K pages) per unmap, which can severely impact performance on large
> >>> systems and even cause soft lockups during NV (L1/L2) boots with many
> >>> CPUs and large memory. It also causes long delays during L1 reboot.
> >>> 
> >>> This patch adds a maple-tree-based lookup that records canonical-IPA to
> >>> shadow-IPA mappings whenever a page is mapped into any shadow (L2)
> >>> table. On unmap, the lookup is used to target only those shadow IPAs
> >>> which are fully or partially mapped in shadow S2-MMU tables, avoiding
> >>> a full-address-space walk and unnecessary unmap/flush operations.
> >>> 
> >>> The lookup is updated on map/unmap operations so entries remain
> >>> consistent with shadow table state. Use it during unmap to invalidate
> >>> only affected shadow IPAs, avoiding unnecessary CPU work and reducing
> >>> latency when shadow mappings are sparse.
> >>> 
> >>> Reviewed-by: Christoph Lameter (Ampere) <cl@...two.org>
> >>> Signed-off-by: Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>
> >>> ---
> >>> 
> >>> Changes since v1:
> >>> 		Rebased to 6.18-rc1.
> >>> 		Fixed alignment issue while adding the shadow ipa range
> >>> 		to lookup.	
> >>> 
> >>> Changes since RFC v1:
> >>> 		Added maple tree based lookup and updated with review
> >>> 		comments from [1].
> >>> 
> >>> [1] https://lkml.indiana.edu/2403.0/03801.html
> >>> 
> >>>    arch/arm64/include/asm/kvm_host.h   |   3 +
> >>>    arch/arm64/include/asm/kvm_nested.h |   9 +++
> >>>    arch/arm64/kvm/mmu.c                |  17 ++--
> >>>    arch/arm64/kvm/nested.c             | 120 ++++++++++++++++++++++++++--
> >>>    4 files changed, 138 insertions(+), 11 deletions(-)
> >>> 
> >>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >>> index b763293281c8..e774681c6ba4 100644
> >>> --- a/arch/arm64/include/asm/kvm_host.h
> >>> +++ b/arch/arm64/include/asm/kvm_host.h
> >>> @@ -227,6 +227,9 @@ struct kvm_s2_mmu {
> >>>    	 * >0: Somebody is actively using this.
> >>>    	 */
> >>>    	atomic_t refcnt;
> >>> +
> >>> +	/* For IPA to shadow IPA lookup */
> >>> +	struct maple_tree nested_mmu_mt;
> >>>    };
> >>>      struct kvm_arch_memory_slot {
> >>> diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
> >>> index f7c06a840963..5b7c4e7ed18f 100644
> >>> --- a/arch/arm64/include/asm/kvm_nested.h
> >>> +++ b/arch/arm64/include/asm/kvm_nested.h
> >>> @@ -69,6 +69,8 @@ extern void kvm_init_nested(struct kvm *kvm);
> >>>    extern int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu);
> >>>    extern void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu);
> >>>    extern struct kvm_s2_mmu *lookup_s2_mmu(struct kvm_vcpu *vcpu);
> >>> +extern int add_to_shadow_ipa_lookup(struct kvm_pgtable *pgt, u64 shadow_ipa, u64 ipa,
> >>> +		u64 size);
> >>>      union tlbi_info;
> >>>    @@ -95,6 +97,12 @@ struct kvm_s2_trans {
> >>>    	u64 desc;
> >>>    };
> >>>    +struct shadow_ipa_map {
> >>> +	u64 shadow_ipa;
> >>> +	u64 ipa;
> >>> +	u64 size;
> >>> +};
> >>> +
> >>>    static inline phys_addr_t kvm_s2_trans_output(struct kvm_s2_trans *trans)
> >>>    {
> >>>    	return trans->output;
> >>> @@ -132,6 +140,7 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
> >>>    extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
> >>>    extern void kvm_nested_s2_wp(struct kvm *kvm);
> >>>    extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
> >>> +extern void kvm_nested_s2_unmap_range(struct kvm *kvm, u64 ipa, u64 size, bool may_block);
> >>>    extern void kvm_nested_s2_flush(struct kvm *kvm);
> >>>      unsigned long compute_tlb_inval_range(struct kvm_s2_mmu *mmu,
> >>> u64 val);
> >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >>> index 7cc964af8d30..27c120556e1b 100644
> >>> --- a/arch/arm64/kvm/mmu.c
> >>> +++ b/arch/arm64/kvm/mmu.c
> >>> @@ -1872,6 +1872,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>    		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize,
> >>>    					     __pfn_to_phys(pfn), prot,
> >>>    					     memcache, flags);
> >>> +
> >>> +		/* Add to lookup, if canonical IPA range mapped to shadow mmu */
> >>> +		if (nested)
> >>> +			add_to_shadow_ipa_lookup(pgt, fault_ipa, ipa, vma_pagesize);
> >>>    	}
> >>>      out_unlock:
> >>> @@ -2094,14 +2098,15 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >>>      bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range
> >>> *range)
> >>>    {
> >>> +	gpa_t start = range->start << PAGE_SHIFT;
> >>> +	gpa_t end = (range->end - range->start) << PAGE_SHIFT;
> >>> +	bool may_block = range->may_block;
> >>> +
> >>>    	if (!kvm->arch.mmu.pgt)
> >>>    		return false;
> >>>    -	__unmap_stage2_range(&kvm->arch.mmu, range->start <<
> >>> PAGE_SHIFT,
> >>> -			     (range->end - range->start) << PAGE_SHIFT,
> >>> -			     range->may_block);
> >>> -
> >>> -	kvm_nested_s2_unmap(kvm, range->may_block);
> >>> +	__unmap_stage2_range(&kvm->arch.mmu, start, end, may_block);
> >>> +	kvm_nested_s2_unmap_range(kvm, start, end, may_block);
> >>>    	return false;
> >>>    }
> >>>    @@ -2386,7 +2391,7 @@ void kvm_arch_flush_shadow_memslot(struct
> >>> kvm *kvm,
> >>>      	write_lock(&kvm->mmu_lock);
> >>>    	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
> >>> -	kvm_nested_s2_unmap(kvm, true);
> >>> +	kvm_nested_s2_unmap_range(kvm, gpa, size, true);
> >>>    	write_unlock(&kvm->mmu_lock);
> >>>    }
> >>>    diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> >>> index 7a045cad6bdf..3a7035e7526a 100644
> >>> --- a/arch/arm64/kvm/nested.c
> >>> +++ b/arch/arm64/kvm/nested.c
> >>> @@ -7,6 +7,7 @@
> >>>    #include <linux/bitfield.h>
> >>>    #include <linux/kvm.h>
> >>>    #include <linux/kvm_host.h>
> >>> +#include <linux/maple_tree.h>
> >>>      #include <asm/fixmap.h>
> >>>    #include <asm/kvm_arm.h>
> >>> @@ -725,6 +726,7 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
> >>>    	mmu->tlb_vttbr = VTTBR_CNP_BIT;
> >>>    	mmu->nested_stage2_enabled = false;
> >>>    	atomic_set(&mmu->refcnt, 0);
> >>> +	mt_init_flags(&mmu->nested_mmu_mt, MM_MT_FLAGS);
> >>>    }
> >>>      void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> >>> @@ -1067,6 +1069,99 @@ void kvm_nested_s2_wp(struct kvm *kvm)
> >>>    	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
> >>>    }
> >>>    +/*
> >>> + * Store range of canonical IPA mapped to a nested stage 2 mmu table.
> >>> + * Canonical IPA used as pivot in maple tree for the lookup later
> >>> + * while IPA unmap/flush.
> >>> + */
> >>> +int add_to_shadow_ipa_lookup(struct kvm_pgtable *pgt, u64 shadow_ipa,
> >>> +		u64 ipa, u64 size)
> >>> +{
> >>> +	struct kvm_s2_mmu *mmu;
> >>> +	struct shadow_ipa_map *entry;
> >>> +	unsigned long start, end;
> >>> +	int ret;
> >>> +
> >>> +	start = ALIGN_DOWN(ipa, size);
> >>> +	end = start + size;
> >>> +	mmu = pgt->mmu;
> >>> +
> >>> +	entry = kzalloc(sizeof(struct shadow_ipa_map), GFP_KERNEL_ACCOUNT);
> >>> +
> >>> +	if (!entry)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	entry->ipa = start;
> >>> +	entry->shadow_ipa = ALIGN_DOWN(shadow_ipa, size);
> >>> +	entry->size = size;
> >>> +	ret = mtree_store_range(&mmu->nested_mmu_mt, start, end - 1, entry,
> >>> +			  GFP_KERNEL_ACCOUNT);
> >>> +	if (ret) {
> >>> +		kfree(entry);
> >>> +		WARN_ON(ret);
> >>> +	}
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static void nested_mtree_erase(struct maple_tree *mt, unsigned long start,
> >>> +		unsigned long size)
> >>> +{
> >>> +	void *entry = NULL;
> >>> +
> >>> +	MA_STATE(mas, mt, start, start + size - 1);
> >>> +
> >>> +	mtree_lock(mt);
> >>> +	entry = mas_erase(&mas);
> >>> +	mtree_unlock(mt);
> >>> +	kfree(entry);
> >>> +}
> >>> +
> >>> +static void nested_mtree_erase_and_unmap_all(struct kvm_s2_mmu *mmu,
> >>> +		unsigned long start, unsigned long end, bool may_block)
> >>> +{
> >>> +	struct shadow_ipa_map *entry;
> >>> +
> >>> +	mt_for_each(&mmu->nested_mmu_mt, entry, start, kvm_phys_size(mmu)) {
> >>> +		kvm_stage2_unmap_range(mmu, entry->shadow_ipa, entry->size,
> >>> +				may_block);
> >>> +		kfree(entry);
> >>> +	}
> >>> +
> >>> +	mtree_destroy(&mmu->nested_mmu_mt);
> >>> +	mt_init_flags(&mmu->nested_mmu_mt, MM_MT_FLAGS);
> >>> +}
> >>> +
> >>> +void kvm_nested_s2_unmap_range(struct kvm *kvm, u64 ipa, u64 size,
> >>> +		bool may_block)
> >>> +{
> >>> +	int i;
> >>> +	struct shadow_ipa_map *entry;
> >>> +
> >>> +	lockdep_assert_held_write(&kvm->mmu_lock);
> >>> +
> >>> +	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> >>> +		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> >>> +		unsigned long start = ipa;
> >>> +		unsigned long end = ipa + size;
> >>> +
> >>> +		if (!kvm_s2_mmu_valid(mmu))
> >>> +			continue;
> >>> +
> >>> +		do {
> >>> +			entry = mt_find(&mmu->nested_mmu_mt, &start, end - 1);
> >>> +			if (!entry)
> >>> +				break;
> >>> +
> >>> +			kvm_stage2_unmap_range(mmu, entry->shadow_ipa,
> >>> +							entry->size, may_block);
> >>> +			start = entry->ipa + entry->size;
> >>> +			nested_mtree_erase(&mmu->nested_mmu_mt, entry->ipa,
> >>> +							entry->size);
> >>> +		} while (start < end);
> >>> +	}
> >>> +}
> >>> +
> >>>    void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
> >>>    {
> >>>    	int i;
> >>> @@ -1076,8 +1171,10 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
> >>>    	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> >>>    		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> >>>    -		if (kvm_s2_mmu_valid(mmu))
> >>> -			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
> >>> +		if (!kvm_s2_mmu_valid(mmu))
> >>> +			continue;
> >>> +
> >>> +		nested_mtree_erase_and_unmap_all(mmu, 0, kvm_phys_size(mmu), may_block);
> >>>    	}
> >>>      	kvm_invalidate_vncr_ipa(kvm, 0,
> >>> BIT(kvm->arch.mmu.pgt->ia_bits));
> >>> @@ -1091,9 +1188,14 @@ void kvm_nested_s2_flush(struct kvm *kvm)
> >>>      	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> >>>    		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> >>> +		struct shadow_ipa_map *entry;
> >>> +		unsigned long start = 0;
> >>>    -		if (kvm_s2_mmu_valid(mmu))
> >>> -			kvm_stage2_flush_range(mmu, 0, kvm_phys_size(mmu));
> >>> +		if (!kvm_s2_mmu_valid(mmu))
> >>> +			continue;
> >>> +
> >>> +		mt_for_each(&mmu->nested_mmu_mt, entry, start, kvm_phys_size(mmu))
> >>> +			kvm_stage2_flush_range(mmu, entry->shadow_ipa, entry->size);
> >>>    	}
> >>>    }
> >>>    @@ -1104,8 +1206,16 @@ void kvm_arch_flush_shadow_all(struct kvm
> >>> *kvm)
> >>>    	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> >>>    		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> >>>    -		if (!WARN_ON(atomic_read(&mmu->refcnt)))
> >>> +		if (!WARN_ON(atomic_read(&mmu->refcnt))) {
> >>> +			struct shadow_ipa_map *entry;
> >>> +			unsigned long start = 0;
> >>> +
> >>>    			kvm_free_stage2_pgd(mmu);
> >>> +
> >>> +			mt_for_each(&mmu->nested_mmu_mt, entry, start, kvm_phys_size(mmu))
> >>> +				kfree(entry);
> >>> +			mtree_destroy(&mmu->nested_mmu_mt);
> >>> +		}
> >>>    	}
> >>>    	kvfree(kvm->arch.nested_mmus);
> >>>    	kvm->arch.nested_mmus = NULL;
> >> 
> >> Any review comments or suggestions for this patch?
> > 
> > None. This patch is obviously lacking the basic requirements that such
> > an "optimisation" should handle, such as dealing with multiple
> > mappings to the same IPA in the shadow S2, hence will happily fail to
> > correctly unmap stuff. There is no documentation, and no test.
> > 
> Thanks for the comment.
> How about adding list of multiple mappings ranges to the maple tree
> entry/node while adding to the lookup and later unmapping every
> range present in that list?

How will that work when the various mappings don't have the same size?

> I tested this patch on an AmpereOne system (2 NUMA nodes, 96 CPUs
> per node, numa balance enabled) with large vCPU counts and large
> memory to L1 and L2.  The current full-address-space walk caused
> very large unmap/flush work and significant delays (exacerbated by
> NUMA balancing / page migration activity).  The targeted unmap using
> the list per node removes only the affected mappings and reduces the
> unmap latency substantially in our workloads.

I really don't care how badly things perform. The important thing is
that this is architecturally correct, while your approach isn't.

> I booted multiple L1s, each hosting several L2s, and observed no
> panics or failures related to missing support for multiple‑IPA
> mappings.

I'm sorry, but Linux isn't a validation tool for the architecture. You
have clearly designed something around Linux's own behaviour, not the
architectural requirements.

> If you have any test cases or scenarios that would validate support
> for multiple IPA mappings, could you please share them?

The onus is on *you* to provide them, not me.

I've repeatedly said that I didn't care about performance when we have
so little to validate the existing code. This still stands.

	M.

-- 
Without deviation from the norm, progress is not possible.