lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0AA5A319-C4FC-4EEB-9317-BDC9E2E2E703@nutanix.com>
Date: Fri, 31 Oct 2025 17:58:29 +0000
From: Jon Kohler <jon@...anix.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>,
        "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the
 fastpath run loop



> On Oct 30, 2025, at 6:42 PM, Sean Christopherson <seanjc@...gle.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Move KVM's swapping of PKRU outside of the fastpath loop, as there is no
> KVM code anywhere in the fastpath that accesses guest/userspace memory,
> i.e. that can consume protection keys.
> 
> As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when
> guest/host switches"), KVM just needs to ensure the host's PKRU is loaded
> when KVM (or the kernel at-large) may access userspace memory.  And at the
> time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was
> strictly contained to VMX, i.e. there was no reason to swap PKRU outside
> of vmx_vcpu_run().
> 
> Over time, the "need" to swap PKRU close to VM-Enter was likely falsely
> solidified by the association with XFEATUREs in commit 37486135d3a7
> ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"),
> and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a
> KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716
> ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
> 
> Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel,
> and ~60 cycles for AMD.  E.g. using INVD in KVM-Unit-Test's vmexit.c,
> with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers
> for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go
> from ~810 => ~770.
> 
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
> arch/x86/kvm/svm/svm.c |  2 --
> arch/x86/kvm/vmx/vmx.c |  4 ----
> arch/x86/kvm/x86.c     | 14 ++++++++++----
> arch/x86/kvm/x86.h     |  2 --
> 4 files changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e8b158f73c79..e1fb853c263c 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4260,7 +4260,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> svm_set_dr6(vcpu, DR6_ACTIVE_LOW);
> 
> clgi();
> - kvm_load_guest_xsave_state(vcpu);
> 
> /*
> * Hardware only context switches DEBUGCTL if LBR virtualization is
> @@ -4303,7 +4302,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>    vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl)
> update_debugctlmsr(vcpu->arch.host_debugctl);
> 
> - kvm_load_host_xsave_state(vcpu);
> stgi();
> 
> /* Any pending NMI will happen here */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 123dae8cf46b..55d637cea84a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7465,8 +7465,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> vmx_set_interrupt_shadow(vcpu, 0);
> 
> - kvm_load_guest_xsave_state(vcpu);
> -
> pt_guest_enter(vmx);
> 
> atomic_switch_perf_msrs(vmx);
> @@ -7510,8 +7508,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> 
> pt_guest_exit(vmx);
> 
> - kvm_load_host_xsave_state(vcpu);
> -
> if (is_guest_mode(vcpu)) {
> /*
> * Track VMLAUNCH/VMRESUME that have made past guest state
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b5c2879e3330..6924006f0796 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1233,7 +1233,7 @@ static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
> }
> }
> 
> -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> +static void kvm_load_guest_pkru(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.guest_state_protected)
> return;
> @@ -1244,9 +1244,8 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
> wrpkru(vcpu->arch.pkru);
> }
> -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_guest_xsave_state);
> 
> -void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
> +static void kvm_load_host_pkru(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.guest_state_protected)
> return;
> @@ -1259,7 +1258,6 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
> wrpkru(vcpu->arch.host_pkru);
> }
> }
> -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_host_xsave_state);
> 
> #ifdef CONFIG_X86_64
> static inline u64 kvm_guest_supported_xfd(struct kvm_vcpu *vcpu)
> @@ -11331,6 +11329,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> 
> guest_timing_enter_irqoff();
> 
> + /*
> + * Swap PKRU with hardware breakpoints disabled to minimize the number
> + * of flows where non-KVM code can run with guest state loaded.
> + */
> + kvm_load_guest_pkru(vcpu);
> +

I was mocking this up after PUCK, and went down a similar-ish path, but was
thinking it might be interesting to have an x86 op called something to the effect of
“prepare_switch_to_guest_irqoff” and “prepare_switch_to_host_irqoff”, which
might make for a place to nestle any other sort of “needs to be done in atomic
context but doesn’t need to be done in the fast path” sort of stuff (if any).

One other one that caught my eye was the cr3 stuff that was moved out a while
ago, but then moved back with 1a7158101.

I haven’t gone through absolutely everything else in that tight loop code (and didn’t
get a chance to do the same for SVM code), but figured I’d put the idea out there
to see what you think.

To be clear, I’m totally OK with the series as-is, just thinking about perhaps future
ways to incrementally optimize here?

> for (;;) {
> /*
> * Assert that vCPU vs. VM APICv state is consistent.  An APICv
> @@ -11359,6 +11363,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> ++vcpu->stat.exits;
> }
> 
> + kvm_load_host_pkru(vcpu);
> +
> /*
> * Do this here before restoring debug registers on the host.  And
> * since we do this before handling the vmexit, a DR access vmexit
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index f3dc77f006f9..24c754b0db2e 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -622,8 +622,6 @@ static inline void kvm_machine_check(void)
> #endif
> }
> 
> -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> -void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> int kvm_spec_ctrl_test_value(u64 value);
> int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
>      struct x86_exception *e);
> -- 
> 2.51.1.930.gacf6e81ea2-goog
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ