linux-kernel - Re: [PATCH 05/10] KVM/x86: Add KVM_MSR_RET

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aTL0ItlXp5gRi6Q8@google.com>
Date: Fri, 5 Dec 2025 07:02:58 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: "Jürgen Groß" <jgross@...e.com>
Cc: Vitaly Kuznetsov <vkuznets@...hat.com>, linux-kernel@...r.kernel.org, x86@...nel.org, 
	kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 05/10] KVM/x86: Add KVM_MSR_RET_* defines for values 0 and 1

On Fri, Dec 05, 2025, Jürgen Groß wrote:
> On 05.12.25 11:23, Vitaly Kuznetsov wrote:
> > Juergen Gross <jgross@...e.com> writes:
> > 
> > > For MSR emulation return values only 2 special cases have defines,
> > > while the most used values 0 and 1 don't.
> > > 
> > > Reason seems to be the maze of function calls of MSR emulation
> > > intertwined with the KVM guest exit handlers, which are using the
> > > values 0 and 1 for other purposes. This even led to the comment above
> > > the already existing defines, warning to use the values 0 and 1 (and
> > > negative errno values) in the MSR emulation at all.
> > > 
> > > Fact is that MSR emulation and exit handlers are in fact rather well
> > > distinct, with only very few exceptions which are handled in a sane
> > > way.
> > > 
> > > So add defines for 0 and 1 values of MSR emulation and at the same
> > > time comments where exit handlers are calling into MSR emulation.
> > > 
> > > The new defines will be used later.
> > > 
> > > No change of functionality intended.
> > > 
> > > Signed-off-by: Juergen Gross <jgross@...e.com>
> > > ---
> > >   arch/x86/kvm/x86.c |  2 ++
> > >   arch/x86/kvm/x86.h | 10 ++++++++--
> > >   2 files changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index e733cb923312..e87963a47aa5 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2130,6 +2130,7 @@ static int __kvm_emulate_rdmsr(struct kvm_vcpu *vcpu, u32 msr, int reg,
> > >   	u64 data;
> > >   	int r;
> > > +	/* Call MSR emulation. */

Why?  The function name makes it pretty clear its doing MSR emulation.

> > >   	r = kvm_emulate_msr_read(vcpu, msr, &data);
> > >   	if (!r) {
> > > @@ -2171,6 +2172,7 @@ static int __kvm_emulate_wrmsr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> > >   {
> > >   	int r;
> > > +	/* Call MSR emulation. */
> > >   	r = kvm_emulate_msr_write(vcpu, msr, data);
> > >   	if (!r) {
> > >   		trace_kvm_msr_write(msr, data);
> > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > > index f3dc77f006f9..e44b6373b106 100644
> > > --- a/arch/x86/kvm/x86.h
> > > +++ b/arch/x86/kvm/x86.h
> > > @@ -639,15 +639,21 @@ enum kvm_msr_access {
> > >   /*
> > >    * Internal error codes that are used to indicate that MSR emulation encountered
> > >    * an error that should result in #GP in the guest, unless userspace handles it.
> > > - * Note, '1', '0', and negative numbers are off limits, as they are used by KVM
> > > - * as part of KVM's lightly documented internal KVM_RUN return codes.
> > > + * Note, negative errno values are possible for return values, too.
> > > + * In case MSR emulation is called from an exit handler, any return value other
> > > + * than KVM_MSR_RET_OK will normally result in a GP in the guest.
> > >    *
> > > + * OK		- Emulation succeeded. Must be 0, as in some cases return values
> > > + *		  of functions returning 0 or -errno will just be passed on.

This is (dangerously) misleading due to KVM's use of -errno/0/1 to communicate
whether to exit to userspace or return to the guest.  E.g. my first read of it
is that you meant "will just be passed on up the stack to vcpu_enter_guest()",
but that's flat out wrong because returning KVM_MSR_RET_OK would generate an
unexpected userspace exit.

IMO, the real reason '0' is special is because of the myriad code that treats
'0' as success and everything else as failure.  E.g. making KVM_MSR_RET_OK a
non-zero value would break code like this:

	r = kvm_emulate_msr_read(vcpu, msr, &data);
	if (!r) {
		<happy path>
	} else {
		<sad path>
	}

> > > + * ERR		- Some error occurred.

And this is partly why the conversion stalled out.  *All* of the non-zero values
report an error of some kind.  This really should be something like KVM_MSR_RET_INVALID
or KVM_MSR_RET_FAULTED, but using such precise names would result in "bad" code,
e.g. due to many flows returning '1' when they really mean KVM_MSR_RET_UNSUPPORTED.

> > >    * UNSUPPORTED	- The MSR isn't supported, either because it is completely
> > >    *		  unknown to KVM, or because the MSR should not exist according
> > >    *		  to the vCPU model.
> > >    *
> > >    * FILTERED	- Access to the MSR is denied by a userspace MSR filter.
> > >    */
> > > +#define  KVM_MSR_RET_OK			0
> > > +#define  KVM_MSR_RET_ERR		1
> > >   #define  KVM_MSR_RET_UNSUPPORTED	2
> > >   #define  KVM_MSR_RET_FILTERED		3
> > 
> > I like the general idea of the series as 1/0 can indeed be
> > confusing. What I'm wondering is if we can do better by changing 'int'
> > return type to something else. I.e. if the result of the function can be
> > 'passed on' and KVM_MSR_RET_OK/KVM_MSR_RET_ERR have one meaning while
> > KVM_MSR_RET_UNSUPPORTED/KVM_MSR_RET_FILTERED have another, maybe we can
> > do better by changing the return type to something and then, when the
> > value needs to be passed on, do proper explicit vetting of the result
> > (e.g. to make sure only 1/0 pass through)? Just a thought, I think the
> > series as-is makes things better and we can go with it for now.
> 
> The pass through case is always 0 or -errno, never the "1" (and of course
> KVM_MSR_RET_*/-errno).
> 
> Changing from int to something else would probably require some helpers
> for e.g. stuffing something like -EINVAL into it. An enum alone wouldn't
> work for this, so it would need to be a specific new type, like a union
> of an int (for the -errno) and an enum, but I believe this would make the
> code harder to read instead of improving it.

Hmm, for this specify case I probably agree it's not worth hardening.

But for the the broader -errno/0/1 pattern, I think enforcing the return type
would would be a huge net positive.  AFAICT, by far the biggest problem is the
sheer amount of code that needs to be updated, because truly hardening the returns
will disallow implicit casts and comparisons.  Which is a good thing (and kinda
the whole point), it just makes it hard to do incremental conversions.

We might still be able to do a somewhat incremental conversion by working "ground
up", i.e. by starting from the lowest helpers and "pausing" the conversion at
convienent choke points.  But that may or may not be better than going straight
to a full conversion.

And to help guard against goof during the transition, we could add e.g.
CONFIG_KVM_PROVE_RUN to generate off-by-default WARNs on bad return values.

E.g. drawing nomenclature from fastpath_t, with a deliberately terse typedef name
(to keep function prototypes short) and a bit of macro magic:

---
 arch/x86/include/asm/kvm_host.h | 17 +++++++++
 arch/x86/kvm/vmx/vmx.c          | 65 +++++++++++++++------------------
 2 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5a3bfa293e8b..0b9f47669db3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -222,6 +222,23 @@ enum exit_fastpath_completion {
 };
 typedef enum exit_fastpath_completion fastpath_t;
 
+typedef struct { int r; } run_t;
+#define KVM_RUN_ERR(err) ({ static_assert(err < 0); (run_t) { .r = err }; })
+#define KVM_RUN_EXIT_USERSPACE ({ (run_t) { .r = 0 }; })
+#define KVM_RUN_REENTER_GUEST ({ (run_t) { .r = 1 }; })
+
+#ifdef CONFIG_KVM_PROVE_RUN
+#define KVM_RUN_WARN_ON(x) WARN_ON_ONCE(x)
+#else
+#define KVM_RUN_WARN_ON(x) BUILD_BUG_ON_INVALID(x)
+#endif
+
+static __always_inline bool KVM_REENTER_GUEST(run_t ret)
+{
+	KVM_RUN_WARN_ON(ret.r && ret.r > 1);
+	return ret.r > 0;
+}
+
 struct x86_emulate_ctxt;
 struct x86_exception;
 union kvm_smram;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ef8d29c677b9..fdee70f6a0a8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5958,16 +5958,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
 }
 
-static int handle_nmi_window(struct kvm_vcpu *vcpu)
+static run_t handle_nmi_window(struct kvm_vcpu *vcpu)
 {
 	if (KVM_BUG_ON(!enable_vnmi, vcpu->kvm))
-		return -EIO;
+		return KVM_RUN_ERR(-EIO);
 
 	exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
 	++vcpu->stat.nmi_window_exits;
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
-	return 1;
+	return KVM_RUN_REENTER_GUEST;
 }
 
 /*
@@ -6187,20 +6187,20 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
  * When nested=0, all VMX instruction VM Exits filter here.  The handlers
  * are overwritten by nested_vmx_hardware_setup() when nested=1.
  */
-static int handle_vmx_instruction(struct kvm_vcpu *vcpu)
+static run_t handle_vmx_instruction(struct kvm_vcpu *vcpu)
 {
 	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
+	return KVM_RUN_REENTER_GUEST;
 }
 
-static int handle_tdx_instruction(struct kvm_vcpu *vcpu)
+static run_t handle_tdx_instruction(struct kvm_vcpu *vcpu)
 {
 	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
+	return KVM_RUN_REENTER_GUEST;
 }
 
 #ifndef CONFIG_X86_SGX_KVM
-static int handle_encls(struct kvm_vcpu *vcpu)
+static run_t handle_encls(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * SGX virtualization is disabled.  There is no software enable bit for
@@ -6208,11 +6208,11 @@ static int handle_encls(struct kvm_vcpu *vcpu)
 	 * the guest from executing ENCLS (when SGX is supported by hardware).
 	 */
 	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
+	return KVM_RUN_REENTER_GUEST;
 }
 #endif /* CONFIG_X86_SGX_KVM */
 
-static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
+static run_t handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * Hardware may or may not set the BUS_LOCK_DETECTED flag on BUS_LOCK
@@ -6220,10 +6220,10 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
 	 * vmx_handle_exit().
 	 */
 	to_vt(vcpu)->exit_reason.bus_lock_detected = true;
-	return 1;
+	return KVM_RUN_REENTER_GUEST;
 }
 
-static int handle_notify(struct kvm_vcpu *vcpu)
+static run_t handle_notify(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qual = vmx_get_exit_qual(vcpu);
 	bool context_invalid = exit_qual & NOTIFY_VM_CONTEXT_INVALID;
@@ -6243,10 +6243,10 @@ static int handle_notify(struct kvm_vcpu *vcpu)
 		vcpu->run->exit_reason = KVM_EXIT_NOTIFY;
 		vcpu->run->notify.flags = context_invalid ?
 					  KVM_NOTIFY_CONTEXT_INVALID : 0;
-		return 0;
+		return KVM_RUN_EXIT_USERSPACE;
 	}
 
-	return 1;
+	return KVM_RUN_REENTER_GUEST;
 }
 
 static int vmx_get_msr_imm_reg(struct kvm_vcpu *vcpu)
@@ -6254,24 +6254,19 @@ static int vmx_get_msr_imm_reg(struct kvm_vcpu *vcpu)
 	return vmx_get_instr_info_reg(vmcs_read32(VMX_INSTRUCTION_INFO));
 }
 
-static int handle_rdmsr_imm(struct kvm_vcpu *vcpu)
+static run_t handle_rdmsr_imm(struct kvm_vcpu *vcpu)
 {
 	return kvm_emulate_rdmsr_imm(vcpu, vmx_get_exit_qual(vcpu),
 				     vmx_get_msr_imm_reg(vcpu));
 }
 
-static int handle_wrmsr_imm(struct kvm_vcpu *vcpu)
+static run_t handle_wrmsr_imm(struct kvm_vcpu *vcpu)
 {
 	return kvm_emulate_wrmsr_imm(vcpu, vmx_get_exit_qual(vcpu),
 				     vmx_get_msr_imm_reg(vcpu));
 }
 
-/*
- * The exit handlers return 1 if the exit was handled fully and guest execution
- * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
- * to be done to userspace and return 0.
- */
-static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
+static run_t (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_EXCEPTION_NMI]           = handle_exception_nmi,
 	[EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt,
 	[EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,
@@ -6641,7 +6636,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
  */
-static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+static run_t __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu);
@@ -6666,7 +6661,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	 * allowed a nested VM-Enter with an invalid vmcs12.  More below.
 	 */
 	if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm))
-		return -EIO;
+		return KVM_RUN_ERR(-EIO);
 
 	if (is_guest_mode(vcpu)) {
 		/*
@@ -6702,11 +6697,11 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 		 */
 		if (vmx->vt.emulation_required) {
 			nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
-			return 1;
+			return KVM_RUN_REENTER_GUEST;
 		}
 
 		if (nested_vmx_reflect_vmexit(vcpu))
-			return 1;
+			return KVM_RUN_REENTER_GUEST;
 	}
 
 	/* If guest state is invalid, start emulating.  L2 is handled above. */
@@ -6719,7 +6714,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 		vcpu->run->fail_entry.hardware_entry_failure_reason
 			= exit_reason.full;
 		vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
-		return 0;
+		return KVM_RUN_EXIT_USERSPACE;
 	}
 
 	if (unlikely(vmx->fail)) {
@@ -6728,7 +6723,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 		vcpu->run->fail_entry.hardware_entry_failure_reason
 			= vmcs_read32(VM_INSTRUCTION_ERROR);
 		vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
-		return 0;
+		return KVM_RUN_EXIT_USERSPACE;
 	}
 
 	if ((vectoring_info & VECTORING_INFO_VALID_MASK) &&
@@ -6740,7 +6735,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	     exit_reason.basic != EXIT_REASON_NOTIFY &&
 	     exit_reason.basic != EXIT_REASON_EPT_MISCONFIG)) {
 		kvm_prepare_event_vectoring_exit(vcpu, INVALID_GPA);
-		return 0;
+		return KVM_RUN_EXIT_USERSPACE;
 	}
 
 	if (unlikely(!enable_vnmi &&
@@ -6763,7 +6758,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	}
 
 	if (exit_fastpath != EXIT_FASTPATH_NONE)
-		return 1;
+		return KVM_RUN_REENTER_GUEST;
 
 	if (exit_reason.basic >= kvm_vmx_max_exit_handlers)
 		goto unexpected_vmexit;
@@ -6794,25 +6789,25 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 unexpected_vmexit:
 	dump_vmcs(vcpu);
 	kvm_prepare_unexpected_reason_exit(vcpu, exit_reason.full);
-	return 0;
+	return KVM_RUN_EXIT_USERSPACE;
 }
 
 int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 {
-	int ret = __vmx_handle_exit(vcpu, exit_fastpath);
+	run_t ret = __vmx_handle_exit(vcpu, exit_fastpath);
 
 	/*
 	 * Exit to user space when bus lock detected to inform that there is
 	 * a bus lock in guest.
 	 */
 	if (vmx_get_exit_reason(vcpu).bus_lock_detected) {
-		if (ret > 0)
+		if (KVM_REENTER_GUEST(ret))
 			vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
 
 		vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
-		return 0;
+		return KVM_RUN_EXIT_USERSPACE.r;
 	}
-	return ret;
+	return ret.r;
 }
 
 void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)

base-commit: 99e111dd57b5e5d4c673164f9026ea96eedc9adf
--