linux-kernel - Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aXov3WWozd2UIFXw@google.com>
Date: Wed, 28 Jan 2026 07:48:45 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Kevin Cheng <chengkev@...gle.com>, pbonzini@...hat.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

On Thu, Jan 22, 2026, Yosry Ahmed wrote:
> On Wed, Jan 21, 2026 at 02:07:56PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 21, 2026, Kevin Cheng wrote:
> > > When KVM emulates an instruction for L2 and encounters a nested page
> > > fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> > > injects an NPF to L1. However, the code incorrectly hardcodes
> > > (1ULL << 32) for exit_info_1's upper bits when the original exit was
> > > not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> > > occurred on a page table page, preventing L1 from correctly identifying
> > > the cause of the fault.
> > > 
> > > Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> > > occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> > > the fault occurs on the final GPA-to-HPA translation.
> > > 
> > > Widen error_code in struct x86_exception from u16 to u64 to accommodate
> > > the PFERR_GUEST_* bits (bits 32 and 33).
> > 
> > Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
> > error codes, let alone 64-bit error codes, so this seemingly innocuous change
> > needs to be accompanied by a lengthy changelog that effectively audits all usage
> > to "prove" this change is ok.
> 
> Semi-jokingly, we can add error_code_hi to track the high bits and
> side-step the problem for Intel (dejavu?).

Technically, it would require three fields: u16 error_code, u16 error_code_hi,
and u32 error_code_ultra_hi.  :-D

Isolating the (ultra) hi flags is very tempting, but I worry that it would lead
to long term pain, e.g. because inevitably we'll forget to grab the hi flags at
some point.  I'd rather audit the current code and ensure that KVM truncates the
error code as needed.

VMX is probably a-ok, e.g. see commit eba9799b5a6e ("KVM: VMX: Drop bits 31:16
when shoving exception error code into VMCS").  I'd be more worred SVM, where
it's legal to shove a 32-bit value into the error code, i.e. where KVM might not
have existing explicit truncation.

> > > Update nested_svm_inject_npf_exit() to use fault->error_code directly
> > > instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
> > > PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
> > > indicate a bug in the page fault handling code.
> > > 
> > > Signed-off-by: Kevin Cheng <chengkev@...gle.com>
> [..]
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index de90b104a0dd5..f8dfd5c333023 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> > >  	struct vmcb *vmcb = svm->vmcb;
> > >  
> > >  	if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> > > -		/*
> > > -		 * TODO: track the cause of the nested page fault, and
> > > -		 * correctly fill in the high bits of exit_info_1.
> > > -		 */
> > > -		vmcb->control.exit_code = SVM_EXIT_NPF;
> > > -		vmcb->control.exit_info_1 = (1ULL << 32);
> > > +		vmcb->control.exit_info_1 = fault->error_code;
> > >  		vmcb->control.exit_info_2 = fault->address;
> > >  	}
> > >  
> > > +	vmcb->control.exit_code = SVM_EXIT_NPF;
> > >  	vmcb->control.exit_info_1 &= ~0xffffffffULL;
> > >  	vmcb->control.exit_info_1 |= fault->error_code;
> > 
> > So... what happens when exit_info_1 already has PFERR_GUEST_PAGE_MASK, and then
> > @fault sets PFERR_GUEST_FINAL_MASK?  Presumably that can't/shouldn't happen,
> > but nothing in the changelog explains why such a scenario is
> > impossible, and nothing in the code hardens KVM against such goofs.
> 
> I guess we can update the WARN below to check for that as well, and
> fallback to the current behavior (set PFERR_GUEST_FINAL_MASK):
> 
> 	fault_stage = vmcb->control.exit_info_1 &
> 			(PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK);
> 	if (WARN_ON_ONCE(fault_stage != PFERR_GUEST_FINAL_MASK &&
> 			 fault_stage != PFERR_GUEST_PAGE_MASK))
> 		vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;

Except that doesn't do the right thing if both bits are set.  And we can use
hweight64(), which is a single POPCNT on modern CPUs.

Might be easiest to add something like PFERR_GUEST_FAULT_STAGE_MASK, then do:

	/*
	 * All nested page faults should be annotated as occuring on the final
	 * translation *OR* the page walk.  Arbitrarily choose "final" if KVM
	 * is buggy and enumerated both or none.
	 */
	if (WARN_ON_ONCE(hweight64(vmcb->control.exit_info_1 &
				   PFERR_GUEST_FAULT_STAGE_MASK) != 1)) {
		vmcb->control.exit_info_1 &= ~PFERR_GUEST_FAULT_STAGE_MASK;	
		vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;
	}