linux-kernel - Re: [PATCH v4 1/3] KVM: arm64: Add pre_fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ldhvdxio.wl-maz@kernel.org>
Date: Sun, 18 Jan 2026 10:29:35 +0000
From: Marc Zyngier <maz@...nel.org>
To: "Thomson, Jack" <jackabt.amazon@...il.com>
Cc: oliver.upton@...ux.dev,
	pbonzini@...hat.com,
	joey.gouly@....com,
	suzuki.poulose@....com,
	yuzenghui@...wei.com,
	catalin.marinas@....com,
	will@...nel.org,
	shuah@...nel.org,
	linux-arm-kernel@...ts.infradead.org,
	kvmarm@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	linux-kselftest@...r.kernel.org,
	isaku.yamahata@...el.com,
	xmarcalx@...zon.co.uk,
	kalyazin@...zon.co.uk,
	jackabt@...zon.com,
	Vladimir Murzin <vladimir.murzin@....com>
Subject: Re: [PATCH v4 1/3] KVM: arm64: Add pre_fault_memory implementation

On Fri, 16 Jan 2026 14:33:42 +0000,
"Thomson, Jack" <jackabt.amazon@...il.com> wrote:
> 
> 
> Hey Marc,
> 
> Thanks for the review.
> 
> On 15/01/2026 9:51 am, Marc Zyngier wrote:
> > [+ Vladimir, who was also looking at this patch]
> > 
> > On Tue, 13 Jan 2026 15:26:40 +0000,
> > Jack Thomson <jackabt.amazon@...il.com> wrote:
> >> 
> >> From: Jack Thomson <jackabt@...zon.com>
> >> 
> >> Add kvm_arch_vcpu_pre_fault_memory() for arm64. The implementation hands
> >> off the stage-2 faulting logic to either gmem_abort() or
> >> user_mem_abort().
> >> 
> >> Add an optional page_size output parameter to user_mem_abort() to
> >> return the VMA page size, which is needed when pre-faulting.
> >> 
> >> Update the documentation to clarify x86 specific behaviour.
> >> 
> >> Signed-off-by: Jack Thomson <jackabt@...zon.com>
> >> ---
> >>   Documentation/virt/kvm/api.rst |  3 +-
> >>   arch/arm64/kvm/Kconfig         |  1 +
> >>   arch/arm64/kvm/arm.c           |  1 +
> >>   arch/arm64/kvm/mmu.c           | 79 ++++++++++++++++++++++++++++++++--
> >>   4 files changed, 79 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 01a3abef8abb..44cfd9e736bb 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -6493,7 +6493,8 @@ Errors:
> >>   KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
> >>   for the current vCPU state.  KVM maps memory as if the vCPU generated a
> >>   stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
> >> -CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.
> >> +CoW.  However, on x86, KVM does not mark any newly created stage-2 PTE as
> >> +Accessed.
> >>     In the case of confidential VM types where there is an initial
> >> set up of
> >>   private guest memory before the guest is 'finalized'/measured, this ioctl
> >> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> >> index 4f803fd1c99a..6872aaabe16c 100644
> >> --- a/arch/arm64/kvm/Kconfig
> >> +++ b/arch/arm64/kvm/Kconfig
> >> @@ -25,6 +25,7 @@ menuconfig KVM
> >>   	select HAVE_KVM_CPU_RELAX_INTERCEPT
> >>   	select KVM_MMIO
> >>   	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> >> +	select KVM_GENERIC_PRE_FAULT_MEMORY
> >>   	select VIRT_XFER_TO_GUEST_WORK
> >>   	select KVM_VFIO
> >>   	select HAVE_KVM_DIRTY_RING_ACQ_REL
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index 4f80da0c0d1d..19bac68f737f 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -332,6 +332,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>   	case KVM_CAP_COUNTER_OFFSET:
> >>   	case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
> >>   	case KVM_CAP_ARM_SEA_TO_USER:
> >> +	case KVM_CAP_PRE_FAULT_MEMORY:
> >>   		r = 1;
> >>   		break;
> >>   	case KVM_CAP_SET_GUEST_DEBUG2:
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index 48d7c372a4cd..499b131f794e 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -1642,8 +1642,8 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>     static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t
> >> fault_ipa,
> >>   			  struct kvm_s2_trans *nested,
> >> -			  struct kvm_memory_slot *memslot, unsigned long hva,
> >> -			  bool fault_is_perm)
> >> +			  struct kvm_memory_slot *memslot, unsigned long *page_size,
> >> +			  unsigned long hva, bool fault_is_perm)
> >>   {
> >>   	int ret = 0;
> >>   	bool topup_memcache;
> >> @@ -1923,6 +1923,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>   	kvm_release_faultin_page(kvm, page, !!ret, writable);
> >>   	kvm_fault_unlock(kvm);
> >>   +	if (page_size)
> >> +		*page_size = vma_pagesize;
> >> +
> >>   	/* Mark the page dirty only if the fault is handled successfully */
> >>   	if (writable && !ret)
> >>   		mark_page_dirty_in_slot(kvm, memslot, gfn);
> >> @@ -2196,8 +2199,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >>   		ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
> >>   				 esr_fsc_is_permission_fault(esr));
> >>   	else
> >> -		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> >> -				     esr_fsc_is_permission_fault(esr));
> >> +		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL,
> >> +				     hva, esr_fsc_is_permission_fault(esr));
> >>   	if (ret == 0)
> >>   		ret = 1;
> >>   out:
> >> @@ -2573,3 +2576,71 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
> >>     	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled,
> >> now_enabled);
> >>   }
> >> +
> >> +long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >> +				    struct kvm_pre_fault_memory *range)
> >> +{
> >> +	struct kvm_vcpu_fault_info *fault_info = &vcpu->arch.fault;
> >> +	struct kvm_s2_trans nested_trans, *nested = NULL;
> >> +	unsigned long page_size = PAGE_SIZE;
> >> +	struct kvm_memory_slot *memslot;
> >> +	phys_addr_t ipa = range->gpa;
> >> +	phys_addr_t end;
> >> +	hva_t hva;
> >> +	gfn_t gfn;
> >> +	int ret;
> >> +
> >> +	if (vcpu_is_protected(vcpu))
> >> +		return -EOPNOTSUPP;
> > 
> > This feels pretty odd. If you have advertised the capability, then
> > saying "not supported" at this stage is not on.
> > 
> 
> Thanks good point, I think I can actually just drop this completely since
> kvm_pvm_ext_allowed() would already exclude this as a capacility.
>

I think you still need some runtime handling, just in case userspace
is acting silly.

> >> +
> >> +	/*
> >> +	 * We may prefault on a shadow stage 2 page table if we are
> >> +	 * running a nested guest.  In this case, we have to resolve the L2
> >> +	 * IPA to the L1 IPA first, before knowing what kind of memory should
> >> +	 * back the L1 IPA.
> >> +	 *
> >> +	 * If the shadow stage 2 page table walk faults, then we return
> >> +	 * -EFAULT
> >> +	 */
> >> +	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu) &&
> >> +	    vcpu->arch.hw_mmu->nested_stage2_enabled) {
> >> +		ret = kvm_walk_nested_s2(vcpu, ipa, &nested_trans);
> >> +		if (ret)
> >> +			return -EFAULT;
> > 
> > And then what? Userspace is completely screwed here, with no way to
> > make any forward progress, because the L1 is in charge of that S2, and
> > L1 is not running. What's the outcome? Light a candle and pray?
> > 
> > Also, the IPA you are passing as a parameter means absolutely nothing
> > in the context of L2. Userspace doesn't have the faintest clue about
> > the memory map presented to L2, as that's L1 business. L1 can
> > absolutely present to L2 a memory map that doesn't have a single
> > address in common with its own.
> > 
> > So this really doesn't work at all.
> > 
> Would just returning -EOPNOTSUPP in this case like:

Absolutely *not*. Userspace has no idea what the guest is doing, and
cannot influence it (other than disabling nesting altogether). This is
just as bad a -EFAULT.

> 
>   if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu) &&
>       vcpu->arch.hw_mmu->nested_stage2_enabled)
>     return -EOPNOTSUPP;
>
> be the best way to continue for now?

We both know that what you actually mean is "this doesn't match my use
case, let someone else deal with it". To which my answer is that you
either fully support pre-faulting, or you don't at all. There is no
middle ground.

> >> +
> >> +		ipa = kvm_s2_trans_output(&nested_trans);
> >> +		nested = &nested_trans;
> >> +	}
> >> +
> >> +	if (ipa >= kvm_phys_size(vcpu->arch.hw_mmu))
> >> +		return -ENOENT;
> >> +
> >> +	/* Generate a synthetic abort for the pre-fault address */
> >> +	fault_info->esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) |
> >> +		ESR_ELx_FSC_FAULT_L(KVM_PGTABLE_LAST_LEVEL);
> > 
> > Why level 3? You must present a fault that matches the level at which
> > the emulated fault would actually occur, because the rest of the
> > infrastructure relies on that (at least on the permission path, and
> > more to come).
> > 
> 
> Ack, thanks I was relying on the fact `fault_is_perm` was hardcoded to
> false. I'll replace with something like:
> 
>   pgt = vcpu->arch.hw_mmu->pgt;
>   ret = kvm_pgtable_get_leaf(pgt, gpa, &pte, &level);
>   if (ret)
>     return ret;
>
>   fault_info->esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) |
>     ESR_ELx_FSC_FAULT_L(level);
>   fault_info->hpfar_el2 = HPFAR_EL2_NS |
>     FIELD_PREP(HPFAR_EL2_FIPA, gpa >> 12);

If a mapping exists, you probably don't want to replay the fault. And
this needs to occur while the mmu_lock is held.

> 
> > Taking a step back on all this, 90% of the problems are there because
> > you are trying to support prefaulting a guest that is already running.
> > If you limited this to actually *pre*-faulting the guest, it would be
> > the easiest thing ever, and wouldn't suffer from any of the above (you
> > can't be in a nested context if you haven't run).
> > 
> > What prevents you from doing so? I'm perfectly happy to make this a
> > separate API if this contradicts other implementations. Or are you
> > relying on other side effects of the "already running" state?
> 
> We would need this to work on an already running guest.

Then you need to fully support pre-faulting the guest even when it is
in a nested context, without resorting to coping-out in situations
that do not match your narrow use-case. Which means populating the
canonical s2_mmu even when you're not in that particular context.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.