linux-kernel - Re: [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQUhkDOkQJJceH4N@google.com>
Date: Fri, 31 Oct 2025 20:52:32 +0000
From: Sean Christopherson <seanjc@...gle.com>
To: Jon Kohler <jon@...anix.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the
 fastpath run loop

On Fri, Oct 31, 2025, Jon Kohler wrote:
> > On Oct 30, 2025, at 6:42 PM, Sean Christopherson <seanjc@...gle.com> wrote:
> > + /*
> > + * Swap PKRU with hardware breakpoints disabled to minimize the number
> > + * of flows where non-KVM code can run with guest state loaded.
> > + */
> > + kvm_load_guest_pkru(vcpu);
> > +
> 
> I was mocking this up after PUCK, and went down a similar-ish path, but was
> thinking it might be interesting to have an x86 op called something to the effect of
> “prepare_switch_to_guest_irqoff” and “prepare_switch_to_host_irqoff”, which
> might make for a place to nestle any other sort of “needs to be done in atomic
> context but doesn’t need to be done in the fast path” sort of stuff (if any).

Hmm, I would say I'm flat out opposed to generic hooks of that nature.  For
anything that _needs_ to be modified with IRQs disabled, the ordering will matter
greatly.  E.g. we already have kvm_x86_ops.sync_pir_to_irr(), and that _must_ run
before kvm_vcpu_exit_request() if it triggers a late request.

And I also want to push for as much stuff as possible to be handled in common x86,
i.e. I want to actively encourage landing things like PKU and CET support in
common x86 instead of implementing support in one vendor and then having to churn
a pile of code to later move it to

> One other one that caught my eye was the cr3 stuff that was moved out a while
> ago, but then moved back with 1a7158101.
> 
> I haven’t gone through absolutely everything else in that tight loop code
> (and didn’t get a chance to do the same for SVM code), but figured I’d put
> the idea out there to see what you think.
> 
> To be clear, I’m totally OK with the series as-is, just thinking about
> perhaps future ways to incrementally optimize here?

To some extent, we're going to hit diminishing returns.  E.g. one of the reasons
I did a straight revert in commit 1a71581012dd is that were talking about a handful
of cycles difference.  E.g. as measured from the guest, eliding the CR3+CR4 checks
shaves 3-5 cycles.  From the host side it _looks_ like more (~20 cycles), but it's
hard to even measure accurately because just doing RDTSC affects the results.

For SVM, I don't see any obvious candidates.  E.g. pre_sev_run() has some code that
only needs to be done on the first iteration, but checking a flag or doing a static
CALL+RET is going to be just as costly as what's already there.

In short, the only flows that will benefit are relatively slow flows and/or flows
that aren't easily predicted by the CPU.  E.g. __get_current_cr3_fast() and
cr4_read_shadow() require CALL+RET and might not be super predictable?  But even
they are on the cusp of "who cares".

And that needs to be balanced against the probability of introducing bugs.  E.g.
this code _could_ be done only on the first iteration:

	if (vmx->ple_window_dirty) {
		vmx->ple_window_dirty = false;
		vmcs_write32(PLE_WINDOW, vmx->ple_window);
	}

but (a) checking vmx->ple_window_dirty is going to be super predictable after the
first iteration, (b) handling PLE exits in the fastpath would break things, and
(c) _if_ we want to optimize that code, it can/should be simply moved to
vmx_prepare_switch_to_guest() (but outside of the guest_state_loaded check).

All that said, I'm not totally opposed to shaving cycles.  Now that @run_flags
is a thing, it's actually trivially easy to optimize the CR3/CR4 checks (famous
last words):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48598d017d6f..5cc1f0168b8a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1709,6 +1709,7 @@ enum kvm_x86_run_flags {
        KVM_RUN_FORCE_IMMEDIATE_EXIT    = BIT(0),
        KVM_RUN_LOAD_GUEST_DR6          = BIT(1),
        KVM_RUN_LOAD_DEBUGCTL           = BIT(2),
+       KVM_RUN_IS_FIRST_ITERATION      = BIT(3),
 };
 
 struct kvm_x86_ops {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 55d637cea84a..3deb20b8d0c5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7439,22 +7439,28 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
                vmx_reload_guest_debugctl(vcpu);
 
        /*
-        * Refresh vmcs.HOST_CR3 if necessary.  This must be done immediately
-        * prior to VM-Enter, as the kernel may load a new ASID (PCID) any time
-        * it switches back to the current->mm, which can occur in KVM context
-        * when switching to a temporary mm to patch kernel code, e.g. if KVM
-        * toggles a static key while handling a VM-Exit.
+        * Refresh vmcs.HOST_CR3 if necessary.  This must be done after IRQs
+        * are disabled, i.e. not when preparing to switch to the guest, as the
+        * the kernel may load a new ASID (PCID) any time it switches back to
+        * the current->mm, which can occur in KVM context when switching to a
+        * temporary mm to patch kernel code, e.g. if KVM toggles a static key
+        * while handling a VM-Exit.
+        *
+        * Refresh host CR3 and CR4 only on the first iteration of the inner
+        * loop, as modifying CR3 or CR4 from NMI context is not allowed.
         */
-       cr3 = __get_current_cr3_fast();
-       if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
-               vmcs_writel(HOST_CR3, cr3);
-               vmx->loaded_vmcs->host_state.cr3 = cr3;
-       }
+       if (run_flags & KVM_RUN_IS_FIRST_ITERATION) {
+               cr3 = __get_current_cr3_fast();
+               if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
+                       vmcs_writel(HOST_CR3, cr3);
+                       vmx->loaded_vmcs->host_state.cr3 = cr3;
+               }
 
-       cr4 = cr4_read_shadow();
-       if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
-               vmcs_writel(HOST_CR4, cr4);
-               vmx->loaded_vmcs->host_state.cr4 = cr4;
+               cr4 = cr4_read_shadow();
+               if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
+                       vmcs_writel(HOST_CR4, cr4);
+                       vmx->loaded_vmcs->host_state.cr4 = cr4;
+               }
        }
 
        /* When single-stepping over STI and MOV SS, we must clear the
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6924006f0796..bff08f58c29a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11286,7 +11286,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                goto cancel_injection;
        }
 
-       run_flags = 0;
+       run_flags = KVM_RUN_IS_FIRST_ITERATION;
        if (req_immediate_exit) {
                run_flags |= KVM_RUN_FORCE_IMMEDIATE_EXIT;
                kvm_make_request(KVM_REQ_EVENT, vcpu);