linux-kernel - Re: [PATCH 05/10] KVM/x86: Add KVM_MSR_RET

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <281ee562-d5fd-4f2b-b409-f7b273933141@suse.com>
Date: Fri, 5 Dec 2025 16:59:33 +0100
From: Jürgen Groß <jgross@...e.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vitaly Kuznetsov <vkuznets@...hat.com>, linux-kernel@...r.kernel.org,
 x86@...nel.org, kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>,
 Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 05/10] KVM/x86: Add KVM_MSR_RET_* defines for values 0 and
 1

On 05.12.25 16:02, Sean Christopherson wrote:
> On Fri, Dec 05, 2025, Jürgen Groß wrote:
>> On 05.12.25 11:23, Vitaly Kuznetsov wrote:
>>> Juergen Gross <jgross@...e.com> writes:
>>>
>>>> For MSR emulation return values only 2 special cases have defines,
>>>> while the most used values 0 and 1 don't.
>>>>
>>>> Reason seems to be the maze of function calls of MSR emulation
>>>> intertwined with the KVM guest exit handlers, which are using the
>>>> values 0 and 1 for other purposes. This even led to the comment above
>>>> the already existing defines, warning to use the values 0 and 1 (and
>>>> negative errno values) in the MSR emulation at all.
>>>>
>>>> Fact is that MSR emulation and exit handlers are in fact rather well
>>>> distinct, with only very few exceptions which are handled in a sane
>>>> way.
>>>>
>>>> So add defines for 0 and 1 values of MSR emulation and at the same
>>>> time comments where exit handlers are calling into MSR emulation.
>>>>
>>>> The new defines will be used later.
>>>>
>>>> No change of functionality intended.
>>>>
>>>> Signed-off-by: Juergen Gross <jgross@...e.com>
>>>> ---
>>>>    arch/x86/kvm/x86.c |  2 ++
>>>>    arch/x86/kvm/x86.h | 10 ++++++++--
>>>>    2 files changed, 10 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index e733cb923312..e87963a47aa5 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -2130,6 +2130,7 @@ static int __kvm_emulate_rdmsr(struct kvm_vcpu *vcpu, u32 msr, int reg,
>>>>    	u64 data;
>>>>    	int r;
>>>> +	/* Call MSR emulation. */
> 
> Why?  The function name makes it pretty clear its doing MSR emulation.

Hmm, true. I just want to make clear the return values are NOT the ones
indicating "return to guest/user". Maybe I should just rename "r" to
"ret_msr" in the related functions?

> 
>>>>    	r = kvm_emulate_msr_read(vcpu, msr, &data);
>>>>    	if (!r) {
>>>> @@ -2171,6 +2172,7 @@ static int __kvm_emulate_wrmsr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>>    {
>>>>    	int r;
>>>> +	/* Call MSR emulation. */
>>>>    	r = kvm_emulate_msr_write(vcpu, msr, data);
>>>>    	if (!r) {
>>>>    		trace_kvm_msr_write(msr, data);
>>>> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>>>> index f3dc77f006f9..e44b6373b106 100644
>>>> --- a/arch/x86/kvm/x86.h
>>>> +++ b/arch/x86/kvm/x86.h
>>>> @@ -639,15 +639,21 @@ enum kvm_msr_access {
>>>>    /*
>>>>     * Internal error codes that are used to indicate that MSR emulation encountered
>>>>     * an error that should result in #GP in the guest, unless userspace handles it.
>>>> - * Note, '1', '0', and negative numbers are off limits, as they are used by KVM
>>>> - * as part of KVM's lightly documented internal KVM_RUN return codes.
>>>> + * Note, negative errno values are possible for return values, too.
>>>> + * In case MSR emulation is called from an exit handler, any return value other
>>>> + * than KVM_MSR_RET_OK will normally result in a GP in the guest.
>>>>     *
>>>> + * OK		- Emulation succeeded. Must be 0, as in some cases return values
>>>> + *		  of functions returning 0 or -errno will just be passed on.
> 
> This is (dangerously) misleading due to KVM's use of -errno/0/1 to communicate
> whether to exit to userspace or return to the guest.  E.g. my first read of it
> is that you meant "will just be passed on up the stack to vcpu_enter_guest()",
> but that's flat out wrong because returning KVM_MSR_RET_OK would generate an
> unexpected userspace exit.
> 
> IMO, the real reason '0' is special is because of the myriad code that treats
> '0' as success and everything else as failure.  E.g. making KVM_MSR_RET_OK a
> non-zero value would break code like this:
> 
> 	r = kvm_emulate_msr_read(vcpu, msr, &data);
> 	if (!r) {
> 		<happy path>
> 	} else {
> 		<sad path>
> 	}
> 

Yes, I can rephrase the comment.

>>>> + * ERR		- Some error occurred.
> 
> And this is partly why the conversion stalled out.  *All* of the non-zero values
> report an error of some kind.  This really should be something like KVM_MSR_RET_INVALID
> or KVM_MSR_RET_FAULTED, but using such precise names would result in "bad" code,
> e.g. due to many flows returning '1' when they really mean KVM_MSR_RET_UNSUPPORTED.
> 
>>>>     * UNSUPPORTED	- The MSR isn't supported, either because it is completely
>>>>     *		  unknown to KVM, or because the MSR should not exist according
>>>>     *		  to the vCPU model.
>>>>     *
>>>>     * FILTERED	- Access to the MSR is denied by a userspace MSR filter.
>>>>     */
>>>> +#define  KVM_MSR_RET_OK			0
>>>> +#define  KVM_MSR_RET_ERR		1
>>>>    #define  KVM_MSR_RET_UNSUPPORTED	2
>>>>    #define  KVM_MSR_RET_FILTERED		3
>>>
>>> I like the general idea of the series as 1/0 can indeed be
>>> confusing. What I'm wondering is if we can do better by changing 'int'
>>> return type to something else. I.e. if the result of the function can be
>>> 'passed on' and KVM_MSR_RET_OK/KVM_MSR_RET_ERR have one meaning while
>>> KVM_MSR_RET_UNSUPPORTED/KVM_MSR_RET_FILTERED have another, maybe we can
>>> do better by changing the return type to something and then, when the
>>> value needs to be passed on, do proper explicit vetting of the result
>>> (e.g. to make sure only 1/0 pass through)? Just a thought, I think the
>>> series as-is makes things better and we can go with it for now.
>>
>> The pass through case is always 0 or -errno, never the "1" (and of course
>> KVM_MSR_RET_*/-errno).
>>
>> Changing from int to something else would probably require some helpers
>> for e.g. stuffing something like -EINVAL into it. An enum alone wouldn't
>> work for this, so it would need to be a specific new type, like a union
>> of an int (for the -errno) and an enum, but I believe this would make the
>> code harder to read instead of improving it.
> 
> Hmm, for this specify case I probably agree it's not worth hardening.
> 
> But for the the broader -errno/0/1 pattern, I think enforcing the return type
> would would be a huge net positive.  AFAICT, by far the biggest problem is the
> sheer amount of code that needs to be updated, because truly hardening the returns
> will disallow implicit casts and comparisons.  Which is a good thing (and kinda
> the whole point), it just makes it hard to do incremental conversions.
> 
> We might still be able to do a somewhat incremental conversion by working "ground
> up", i.e. by starting from the lowest helpers and "pausing" the conversion at
> convienent choke points.  But that may or may not be better than going straight
> to a full conversion.

I like the general approach of doing it in several chunks. Those chunks can be
in one or multiple series, at least it will make it easier to review.

> And to help guard against goof during the transition, we could add e.g.
> CONFIG_KVM_PROVE_RUN to generate off-by-default WARNs on bad return values.
> 
> E.g. drawing nomenclature from fastpath_t, with a deliberately terse typedef name
> (to keep function prototypes short) and a bit of macro magic:

Yes, here I agree it will be an improvement, as there are so many use cases.

BUT I's like to have something like your idea below in include/linux/kvm_host.h,
as it will be usable in other archs, too.


Juergen

> 
> ---
>   arch/x86/include/asm/kvm_host.h | 17 +++++++++
>   arch/x86/kvm/vmx/vmx.c          | 65 +++++++++++++++------------------
>   2 files changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5a3bfa293e8b..0b9f47669db3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,23 @@ enum exit_fastpath_completion {
>   };
>   typedef enum exit_fastpath_completion fastpath_t;
>   
> +typedef struct { int r; } run_t;
> +#define KVM_RUN_ERR(err) ({ static_assert(err < 0); (run_t) { .r = err }; })
> +#define KVM_RUN_EXIT_USERSPACE ({ (run_t) { .r = 0 }; })
> +#define KVM_RUN_REENTER_GUEST ({ (run_t) { .r = 1 }; })
> +
> +#ifdef CONFIG_KVM_PROVE_RUN
> +#define KVM_RUN_WARN_ON(x) WARN_ON_ONCE(x)
> +#else
> +#define KVM_RUN_WARN_ON(x) BUILD_BUG_ON_INVALID(x)
> +#endif
> +
> +static __always_inline bool KVM_REENTER_GUEST(run_t ret)
> +{
> +	KVM_RUN_WARN_ON(ret.r && ret.r > 1);
> +	return ret.r > 0;
> +}
> +
>   struct x86_emulate_ctxt;
>   struct x86_exception;
>   union kvm_smram;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef8d29c677b9..fdee70f6a0a8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5958,16 +5958,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>   	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
>   }
>   
> -static int handle_nmi_window(struct kvm_vcpu *vcpu)
> +static run_t handle_nmi_window(struct kvm_vcpu *vcpu)
>   {
>   	if (KVM_BUG_ON(!enable_vnmi, vcpu->kvm))
> -		return -EIO;
> +		return KVM_RUN_ERR(-EIO);
>   
>   	exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
>   	++vcpu->stat.nmi_window_exits;
>   	kvm_make_request(KVM_REQ_EVENT, vcpu);
>   
> -	return 1;
> +	return KVM_RUN_REENTER_GUEST;
>   }
>   
>   /*
> @@ -6187,20 +6187,20 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
>    * When nested=0, all VMX instruction VM Exits filter here.  The handlers
>    * are overwritten by nested_vmx_hardware_setup() when nested=1.
>    */
> -static int handle_vmx_instruction(struct kvm_vcpu *vcpu)
> +static run_t handle_vmx_instruction(struct kvm_vcpu *vcpu)
>   {
>   	kvm_queue_exception(vcpu, UD_VECTOR);
> -	return 1;
> +	return KVM_RUN_REENTER_GUEST;
>   }
>   
> -static int handle_tdx_instruction(struct kvm_vcpu *vcpu)
> +static run_t handle_tdx_instruction(struct kvm_vcpu *vcpu)
>   {
>   	kvm_queue_exception(vcpu, UD_VECTOR);
> -	return 1;
> +	return KVM_RUN_REENTER_GUEST;
>   }
>   
>   #ifndef CONFIG_X86_SGX_KVM
> -static int handle_encls(struct kvm_vcpu *vcpu)
> +static run_t handle_encls(struct kvm_vcpu *vcpu)
>   {
>   	/*
>   	 * SGX virtualization is disabled.  There is no software enable bit for
> @@ -6208,11 +6208,11 @@ static int handle_encls(struct kvm_vcpu *vcpu)
>   	 * the guest from executing ENCLS (when SGX is supported by hardware).
>   	 */
>   	kvm_queue_exception(vcpu, UD_VECTOR);
> -	return 1;
> +	return KVM_RUN_REENTER_GUEST;
>   }
>   #endif /* CONFIG_X86_SGX_KVM */
>   
> -static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
> +static run_t handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
>   {
>   	/*
>   	 * Hardware may or may not set the BUS_LOCK_DETECTED flag on BUS_LOCK
> @@ -6220,10 +6220,10 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
>   	 * vmx_handle_exit().
>   	 */
>   	to_vt(vcpu)->exit_reason.bus_lock_detected = true;
> -	return 1;
> +	return KVM_RUN_REENTER_GUEST;
>   }
>   
> -static int handle_notify(struct kvm_vcpu *vcpu)
> +static run_t handle_notify(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long exit_qual = vmx_get_exit_qual(vcpu);
>   	bool context_invalid = exit_qual & NOTIFY_VM_CONTEXT_INVALID;
> @@ -6243,10 +6243,10 @@ static int handle_notify(struct kvm_vcpu *vcpu)
>   		vcpu->run->exit_reason = KVM_EXIT_NOTIFY;
>   		vcpu->run->notify.flags = context_invalid ?
>   					  KVM_NOTIFY_CONTEXT_INVALID : 0;
> -		return 0;
> +		return KVM_RUN_EXIT_USERSPACE;
>   	}
>   
> -	return 1;
> +	return KVM_RUN_REENTER_GUEST;
>   }
>   
>   static int vmx_get_msr_imm_reg(struct kvm_vcpu *vcpu)
> @@ -6254,24 +6254,19 @@ static int vmx_get_msr_imm_reg(struct kvm_vcpu *vcpu)
>   	return vmx_get_instr_info_reg(vmcs_read32(VMX_INSTRUCTION_INFO));
>   }
>   
> -static int handle_rdmsr_imm(struct kvm_vcpu *vcpu)
> +static run_t handle_rdmsr_imm(struct kvm_vcpu *vcpu)
>   {
>   	return kvm_emulate_rdmsr_imm(vcpu, vmx_get_exit_qual(vcpu),
>   				     vmx_get_msr_imm_reg(vcpu));
>   }
>   
> -static int handle_wrmsr_imm(struct kvm_vcpu *vcpu)
> +static run_t handle_wrmsr_imm(struct kvm_vcpu *vcpu)
>   {
>   	return kvm_emulate_wrmsr_imm(vcpu, vmx_get_exit_qual(vcpu),
>   				     vmx_get_msr_imm_reg(vcpu));
>   }
>   
> -/*
> - * The exit handlers return 1 if the exit was handled fully and guest execution
> - * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
> - * to be done to userspace and return 0.
> - */
> -static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> +static run_t (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>   	[EXIT_REASON_EXCEPTION_NMI]           = handle_exception_nmi,
>   	[EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt,
>   	[EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,
> @@ -6641,7 +6636,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>    * The guest has exited.  See if we can fix it or if we need userspace
>    * assistance.
>    */
> -static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> +static run_t __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	union vmx_exit_reason exit_reason = vmx_get_exit_reason(vcpu);
> @@ -6666,7 +6661,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   	 * allowed a nested VM-Enter with an invalid vmcs12.  More below.
>   	 */
>   	if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm))
> -		return -EIO;
> +		return KVM_RUN_ERR(-EIO);
>   
>   	if (is_guest_mode(vcpu)) {
>   		/*
> @@ -6702,11 +6697,11 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   		 */
>   		if (vmx->vt.emulation_required) {
>   			nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
> -			return 1;
> +			return KVM_RUN_REENTER_GUEST;
>   		}
>   
>   		if (nested_vmx_reflect_vmexit(vcpu))
> -			return 1;
> +			return KVM_RUN_REENTER_GUEST;
>   	}
>   
>   	/* If guest state is invalid, start emulating.  L2 is handled above. */
> @@ -6719,7 +6714,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   		vcpu->run->fail_entry.hardware_entry_failure_reason
>   			= exit_reason.full;
>   		vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
> -		return 0;
> +		return KVM_RUN_EXIT_USERSPACE;
>   	}
>   
>   	if (unlikely(vmx->fail)) {
> @@ -6728,7 +6723,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   		vcpu->run->fail_entry.hardware_entry_failure_reason
>   			= vmcs_read32(VM_INSTRUCTION_ERROR);
>   		vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
> -		return 0;
> +		return KVM_RUN_EXIT_USERSPACE;
>   	}
>   
>   	if ((vectoring_info & VECTORING_INFO_VALID_MASK) &&
> @@ -6740,7 +6735,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   	     exit_reason.basic != EXIT_REASON_NOTIFY &&
>   	     exit_reason.basic != EXIT_REASON_EPT_MISCONFIG)) {
>   		kvm_prepare_event_vectoring_exit(vcpu, INVALID_GPA);
> -		return 0;
> +		return KVM_RUN_EXIT_USERSPACE;
>   	}
>   
>   	if (unlikely(!enable_vnmi &&
> @@ -6763,7 +6758,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   	}
>   
>   	if (exit_fastpath != EXIT_FASTPATH_NONE)
> -		return 1;
> +		return KVM_RUN_REENTER_GUEST;
>   
>   	if (exit_reason.basic >= kvm_vmx_max_exit_handlers)
>   		goto unexpected_vmexit;
> @@ -6794,25 +6789,25 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   unexpected_vmexit:
>   	dump_vmcs(vcpu);
>   	kvm_prepare_unexpected_reason_exit(vcpu, exit_reason.full);
> -	return 0;
> +	return KVM_RUN_EXIT_USERSPACE;
>   }
>   
>   int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   {
> -	int ret = __vmx_handle_exit(vcpu, exit_fastpath);
> +	run_t ret = __vmx_handle_exit(vcpu, exit_fastpath);
>   
>   	/*
>   	 * Exit to user space when bus lock detected to inform that there is
>   	 * a bus lock in guest.
>   	 */
>   	if (vmx_get_exit_reason(vcpu).bus_lock_detected) {
> -		if (ret > 0)
> +		if (KVM_REENTER_GUEST(ret))
>   			vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
>   
>   		vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
> -		return 0;
> +		return KVM_RUN_EXIT_USERSPACE.r;
>   	}
> -	return ret;
> +	return ret.r;
>   }
>   
>   void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> 
> base-commit: 99e111dd57b5e5d4c673164f9026ea96eedc9adf


Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3684 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)