linux-kernel - Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7c7a5a75-a786-4a05-a836-4368582ca4c2@redhat.com>
Date: Tue, 23 Sep 2025 17:58:03 +0200
From: Paolo Bonzini <pbonzini@...hat.com>
To: Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org
Cc: Sean Christopherson <seanjc@...gle.com>,
 Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>,
 Ingo Molnar <mingo@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
 x86@...nel.org, Borislav Petkov <bp@...en8.de>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in
 kvm_arch_async_page_present_queued

On 8/13/25 21:23, Maxim Levitsky wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>   
>   void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>   {
> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> -	if (!vcpu->arch.apf.pageready_pending)
> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> +
> +	if (!in_guest_mode || !page_ready_pending) {
> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>   		kvm_vcpu_kick(vcpu);
> +	}

Unlike Sean, I think the race exists in abstract and is not benign, but
there are already enough memory barriers to tame it.

That said, in_guest_mode is a red herring.  The way I look at it, is
through the very common wake/sleep (or kick/check) pattern that has
smp_mb() on both sides.

The pair you are considering consists of this  function (the "kick
side"), and the wrmsr handler for MSR_KVM_ASYNC_PF_ACK (the "check
side").  Let's see how the synchronization between the two sides
happens:

- here, you need to check whether to inject the interrupt.  It looks
like this:

   kvm_make_request(KVM_REQ_APF_READY, vcpu);
   smp_mb();
   if (!READ_ONCE(page_ready_pending))
     kvm_vcpu_kick(vcpu);

- on the other side, after clearing page_ready_pending, there will be a
check for a wakeup:

   WRITE_ONCE(page_ready_pending, false);
   smp_mb();
   if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
     kvm_check_async_pf_completion(vcpu)

except that the "if" is not in kvm_set_msr_common(); it will happen
naturally as part of the first re-entry.

So let's look at the changes you need to make, in order to make the code
look like the above.

- using READ_ONCE/WRITE_ONCE for pageready_pending never hurts

- here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
(compiler barrier on x86) is missing after kvm_make_request():

         kvm_make_request(KVM_REQ_APF_READY, vcpu);
	/*
	 * Tell vCPU to wake up before checking if they need an
	 * interrupt.  Pairs with any memory barrier between
	 * the clearing of pageready_pending and vCPU entry.
	 */
	smp_mb__after_atomic();
         if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
                 kvm_vcpu_kick(vcpu);

- in kvm_set_msr_common(), there are two possibilities.
The easy one is to just use smp_store_mb() to clear
vcpu->arch.apf.pageready_pending.  The other would be a comment
like this:

	WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
	/*
	 * Ensure they know to wake this vCPU up, before the vCPU
	 * next checks KVM_REQ_APF_READY.  Use an existing memory
	 * barrier between here and thenext kvm_request_pending(),
	 * for example in vcpu_run().
	 */
	/* smp_mb(); */

plus a memory barrier in common code like this:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 706b6fd56d3c..e302c617e4b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
  		if (r <= 0)
  			break;

+		/*
+		 * Provide a memory barrier between handle_exit and the
+		 * kvm_request_pending() read in vcpu_enter_guest().  It
+		 * pairs with any barrier after kvm_make_request(), for
+		 * example in kvm_arch_async_page_present_queued().
+		 */
+		smp_mb__before_atomic();
  		kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
  		if (kvm_xen_has_pending_events(vcpu))
  			kvm_xen_inject_pending_events(vcpu);

The only advantage of this second, more complex approach is that
it shows *why* the race was not happening.  The 50 clock cycles
saved on an MSR write are not worth the extra complication, and
on a quick grep I could not find other cases which rely on the same
implicit barriers.  So I'd say use smp_store_mb(), with a comment
about the pairing with kvm_arch_async_page_present_queued(); and write
in the commit message that the race wasn't happening thanks to unrelated
memory barriers between handle_exit and the kvm_request_pending()
read in vcpu_enter_guest.

Thanks,

Paolo