linux-kernel - Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAE6NW_YexKSp19uATMQschZbbvon=Cdhv4EH6tRf-FNzgtL6ew@mail.gmail.com>
Date: Wed, 4 Feb 2026 11:22:17 -0500
From: Kevin Cheng <chengkev@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Yosry Ahmed <yosry.ahmed@...ux.dev>, pbonzini@...hat.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

On Wed, Jan 28, 2026 at 10:48 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Thu, Jan 22, 2026, Yosry Ahmed wrote:
> > On Wed, Jan 21, 2026 at 02:07:56PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 21, 2026, Kevin Cheng wrote:
> > > > When KVM emulates an instruction for L2 and encounters a nested page
> > > > fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> > > > injects an NPF to L1. However, the code incorrectly hardcodes
> > > > (1ULL << 32) for exit_info_1's upper bits when the original exit was
> > > > not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> > > > occurred on a page table page, preventing L1 from correctly identifying
> > > > the cause of the fault.
> > > >
> > > > Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> > > > occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> > > > the fault occurs on the final GPA-to-HPA translation.
> > > >
> > > > Widen error_code in struct x86_exception from u16 to u64 to accommodate
> > > > the PFERR_GUEST_* bits (bits 32 and 33).
> > >
> > > Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
> > > error codes, let alone 64-bit error codes, so this seemingly innocuous change
> > > needs to be accompanied by a lengthy changelog that effectively audits all usage
> > > to "prove" this change is ok.
> >
> > Semi-jokingly, we can add error_code_hi to track the high bits and
> > side-step the problem for Intel (dejavu?).
>
> Technically, it would require three fields: u16 error_code, u16 error_code_hi,
> and u32 error_code_ultra_hi.  :-D
>
> Isolating the (ultra) hi flags is very tempting, but I worry that it would lead
> to long term pain, e.g. because inevitably we'll forget to grab the hi flags at
> some point.  I'd rather audit the current code and ensure that KVM truncates the
> error code as needed.
>
> VMX is probably a-ok, e.g. see commit eba9799b5a6e ("KVM: VMX: Drop bits 31:16
> when shoving exception error code into VMCS").  I'd be more worred SVM, where
> it's legal to shove a 32-bit value into the error code, i.e. where KVM might not
> have existing explicit truncation.

As I understand it, intel CPUs don't allow for setting bits 31:16 of
the error code, but AMD CPUs allow bits 31:16 to be set. The
86_exception error_code field is u16 currently so it is always
truncated to u16 by default. In that case, after widening the error
code to 64 bits, do I have to ensure that any usage of the error that
isn't for NPF, has to truncate it to 16 bits? Or do I just need to
verify that all SVM usages of the error_code for exceptions truncate
the 64 bits down to 32 bits and all VMX usages truncate to 16 bits?

Just wanted to clarify because I think the wording of that statement
is confusing me into thinking that maybe there is something wrong with
32 bit error codes for SVM?

If the only usage of the widened field is NPF, wouldn't it be better
to go with an additional field like Yosry suggested (I see that VMX
has the added exit_qualification field in the struct)?

>
> > > > Update nested_svm_inject_npf_exit() to use fault->error_code directly
> > > > instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
> > > > PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
> > > > indicate a bug in the page fault handling code.
> > > >
> > > > Signed-off-by: Kevin Cheng <chengkev@...gle.com>
> > [..]
> > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > index de90b104a0dd5..f8dfd5c333023 100644
> > > > --- a/arch/x86/kvm/svm/nested.c
> > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > @@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> > > >   struct vmcb *vmcb = svm->vmcb;
> > > >
> > > >   if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> > > > -         /*
> > > > -          * TODO: track the cause of the nested page fault, and
> > > > -          * correctly fill in the high bits of exit_info_1.
> > > > -          */
> > > > -         vmcb->control.exit_code = SVM_EXIT_NPF;
> > > > -         vmcb->control.exit_info_1 = (1ULL << 32);
> > > > +         vmcb->control.exit_info_1 = fault->error_code;
> > > >           vmcb->control.exit_info_2 = fault->address;
> > > >   }
> > > >
> > > > + vmcb->control.exit_code = SVM_EXIT_NPF;
> > > >   vmcb->control.exit_info_1 &= ~0xffffffffULL;
> > > >   vmcb->control.exit_info_1 |= fault->error_code;
> > >
> > > So... what happens when exit_info_1 already has PFERR_GUEST_PAGE_MASK, and then
> > > @fault sets PFERR_GUEST_FINAL_MASK?  Presumably that can't/shouldn't happen,
> > > but nothing in the changelog explains why such a scenario is
> > > impossible, and nothing in the code hardens KVM against such goofs.
> >
> > I guess we can update the WARN below to check for that as well, and
> > fallback to the current behavior (set PFERR_GUEST_FINAL_MASK):
> >
> >       fault_stage = vmcb->control.exit_info_1 &
> >                       (PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK);
> >       if (WARN_ON_ONCE(fault_stage != PFERR_GUEST_FINAL_MASK &&
> >                        fault_stage != PFERR_GUEST_PAGE_MASK))
> >               vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;
>
> Except that doesn't do the right thing if both bits are set.  And we can use
> hweight64(), which is a single POPCNT on modern CPUs.
>
> Might be easiest to add something like PFERR_GUEST_FAULT_STAGE_MASK, then do:
>
>         /*
>          * All nested page faults should be annotated as occuring on the final
>          * translation *OR* the page walk.  Arbitrarily choose "final" if KVM
>          * is buggy and enumerated both or none.
>          */
>         if (WARN_ON_ONCE(hweight64(vmcb->control.exit_info_1 &
>                                    PFERR_GUEST_FAULT_STAGE_MASK) != 1)) {
>                 vmcb->control.exit_info_1 &= ~PFERR_GUEST_FAULT_STAGE_MASK;
>                 vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;
>         }