linux-kernel - Re: [PATCH v2 1/4] KVM: arm64: Add pre_fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <808b30b8-a85a-4fe3-be49-bf11414cef93@arm.com>
Date: Thu, 16 Oct 2025 15:01:14 +0100
From: Suzuki K Poulose <suzuki.poulose@....com>
To: Jack Thomson <jackabt.amazon@...il.com>, maz@...nel.org,
 oliver.upton@...ux.dev, pbonzini@...hat.com
Cc: joey.gouly@....com, yuzenghui@...wei.com, catalin.marinas@....com,
 will@...nel.org, shuah@...nel.org, linux-arm-kernel@...ts.infradead.org,
 kvmarm@...ts.linux.dev, linux-kernel@...r.kernel.org,
 linux-kselftest@...r.kernel.org, isaku.yamahata@...el.com,
 roypat@...zon.co.uk, kalyazin@...zon.co.uk, jackabt@...zon.com
Subject: Re: [PATCH v2 1/4] KVM: arm64: Add pre_fault_memory implementation

Hi

On 13/10/2025 16:14, Jack Thomson wrote:
> From: Jack Thomson <jackabt@...zon.com>
> 
> Add kvm_arch_vcpu_pre_fault_memory() for arm64. The implementation hands
> off the stage-2 faulting logic to either gmem_abort() or
> user_mem_abort().
> 
> Add an optional page_size output parameter to user_mem_abort() to
> return the VMA page size, which is needed when pre-faulting.
> 
> Update the documentation to clarify x86 specific behaviour.

Thanks for the patch ! Do we care about faulting beyond the requested 
range ? I understand this doesn't happen for anything that is not
backed by gmem (which might change with hugetlbfs support) or normal
VMs. But for coco VMs this might affect the measurement or even cause
failure in "pre-faulting" because of the extra security checks.
(e.g., trying to fault in twice, because the range is backed by say,
1G page).

Of course these could be addressed via a separate patch, when this
becomes a real requirement.

One way to solve this could be pass on the "pagesize" as the input
parameter which could force the backend to limit the vma_pagesize
that gets used for the stage2 mapping.

> 
> Signed-off-by: Jack Thomson <jackabt@...zon.com>
> ---
>   Documentation/virt/kvm/api.rst |  3 +-
>   arch/arm64/kvm/Kconfig         |  1 +
>   arch/arm64/kvm/arm.c           |  1 +
>   arch/arm64/kvm/mmu.c           | 73 ++++++++++++++++++++++++++++++++--
>   4 files changed, 73 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index c17a87a0a5ac..9e8cc4eb505d 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6461,7 +6461,8 @@ Errors:
>   KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
>   for the current vCPU state.  KVM maps memory as if the vCPU generated a
>   stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
> -CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.
> +CoW.  However, on x86, KVM does not mark any newly created stage-2 PTE as
> +Accessed.
>   
>   In the case of confidential VM types where there is an initial set up of
>   private guest memory before the guest is 'finalized'/measured, this ioctl
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index bff62e75d681..1ac0605f86cb 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -25,6 +25,7 @@ menuconfig KVM
>   	select HAVE_KVM_CPU_RELAX_INTERCEPT
>   	select KVM_MMIO
>   	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> +	select KVM_GENERIC_PRE_FAULT_MEMORY
>   	select KVM_XFER_TO_GUEST_WORK
>   	select KVM_VFIO
>   	select HAVE_KVM_DIRTY_RING_ACQ_REL
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 888f7c7abf54..65654a742864 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -322,6 +322,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_IRQFD_RESAMPLE:
>   	case KVM_CAP_COUNTER_OFFSET:
>   	case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
> +	case KVM_CAP_PRE_FAULT_MEMORY:
>   		r = 1;
>   		break;
>   	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a36426ccd9b5..82f122e4b08c 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1597,8 +1597,8 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   
>   static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   			  struct kvm_s2_trans *nested,
> -			  struct kvm_memory_slot *memslot, unsigned long hva,
> -			  bool fault_is_perm)
> +			  struct kvm_memory_slot *memslot, long *page_size,
> +			  unsigned long hva, bool fault_is_perm)
>   {
>   	int ret = 0;
>   	bool topup_memcache;
> @@ -1871,6 +1871,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   	kvm_release_faultin_page(kvm, page, !!ret, writable);
>   	kvm_fault_unlock(kvm);
>   
> +	if (page_size)
> +		*page_size = vma_pagesize;
> +
>   	/* Mark the page dirty only if the fault is handled successfully */
>   	if (writable && !ret)
>   		mark_page_dirty_in_slot(kvm, memslot, gfn);
> @@ -2069,8 +2072,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>   		ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
>   				 esr_fsc_is_permission_fault(esr));
>   	else
> -		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> -				     esr_fsc_is_permission_fault(esr));
> +		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL,
> +				     hva, esr_fsc_is_permission_fault(esr));
>   	if (ret == 0)
>   		ret = 1;
>   out:
> @@ -2446,3 +2449,65 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>   
>   	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>   }
> +
> +long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> +				    struct kvm_pre_fault_memory *range)
> +{
> +	int ret, idx;
> +	hva_t hva;
> +	phys_addr_t end;
> +	struct kvm_memory_slot *memslot;
> +	struct kvm_vcpu_fault_info stored_fault, *fault_info;
> +
> +	long page_size = PAGE_SIZE;
> +	phys_addr_t ipa = range->gpa;
> +	gfn_t gfn = gpa_to_gfn(range->gpa);
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	if (ipa >= kvm_phys_size(vcpu->arch.hw_mmu)) {
> +		ret = -ENOENT;
> +		goto out_unlock;
> +	}
> +
> +	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> +	if (!memslot) {
> +		ret = -ENOENT;
> +		goto out_unlock;
> +	}
> +
> +	fault_info = &vcpu->arch.fault;
> +	stored_fault = *fault_info;
> +
> +	/* Generate a synthetic abort for the pre-fault address */
> +	fault_info->esr_el2 = FIELD_PREP(ESR_ELx_EC_MASK, ESR_ELx_EC_DABT_CUR);

minor nit: Any reason why we don't use ESR_ELx_EC_DABT_LOW ? We always
get that for a data abort from the guest. Otherwise, this looks
good to me.

Suzuki


> +	fault_info->esr_el2 &= ~ESR_ELx_ISV;
> +	fault_info->esr_el2 |= ESR_ELx_FSC_FAULT_L(KVM_PGTABLE_LAST_LEVEL);
> +
> +	fault_info->hpfar_el2 = HPFAR_EL2_NS |
> +		FIELD_PREP(HPFAR_EL2_FIPA, ipa >> 12);
> +
> +	if (kvm_slot_has_gmem(memslot)) {
> +		ret = gmem_abort(vcpu, ipa, NULL, memslot, false);
> +	} else {
> +		hva = gfn_to_hva_memslot_prot(memslot, gfn, NULL);
> +		if (kvm_is_error_hva(hva)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		ret = user_mem_abort(vcpu, ipa, NULL, memslot, &page_size, hva,
> +				     false);
> +	}
> +
> +	if (ret < 0)
> +		goto out;
> +
> +	end = (range->gpa & ~(page_size - 1)) + page_size;
> +	ret = min(range->size, end - range->gpa);
> +
> +out:
> +	*fault_info = stored_fault;
> +out_unlock:
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +	return ret;
> +}
m