[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7c7a5a75-a786-4a05-a836-4368582ca4c2@redhat.com>
Date: Tue, 23 Sep 2025 17:58:03 +0200
From: Paolo Bonzini <pbonzini@...hat.com>
To: Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org
Cc: Sean Christopherson <seanjc@...gle.com>,
Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>,
Ingo Molnar <mingo@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
x86@...nel.org, Borislav Petkov <bp@...en8.de>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in
kvm_arch_async_page_present_queued
On 8/13/25 21:23, Maxim Levitsky wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>
> void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> {
> - kvm_make_request(KVM_REQ_APF_READY, vcpu);
> - if (!vcpu->arch.apf.pageready_pending)
> + /* Pairs with smp_store_release in vcpu_enter_guest. */
> + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> +
> + if (!in_guest_mode || !page_ready_pending) {
> + kvm_make_request(KVM_REQ_APF_READY, vcpu);
> kvm_vcpu_kick(vcpu);
> + }
Unlike Sean, I think the race exists in abstract and is not benign, but
there are already enough memory barriers to tame it.
That said, in_guest_mode is a red herring. The way I look at it, is
through the very common wake/sleep (or kick/check) pattern that has
smp_mb() on both sides.
The pair you are considering consists of this function (the "kick
side"), and the wrmsr handler for MSR_KVM_ASYNC_PF_ACK (the "check
side"). Let's see how the synchronization between the two sides
happens:
- here, you need to check whether to inject the interrupt. It looks
like this:
kvm_make_request(KVM_REQ_APF_READY, vcpu);
smp_mb();
if (!READ_ONCE(page_ready_pending))
kvm_vcpu_kick(vcpu);
- on the other side, after clearing page_ready_pending, there will be a
check for a wakeup:
WRITE_ONCE(page_ready_pending, false);
smp_mb();
if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
kvm_check_async_pf_completion(vcpu)
except that the "if" is not in kvm_set_msr_common(); it will happen
naturally as part of the first re-entry.
So let's look at the changes you need to make, in order to make the code
look like the above.
- using READ_ONCE/WRITE_ONCE for pageready_pending never hurts
- here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
(compiler barrier on x86) is missing after kvm_make_request():
kvm_make_request(KVM_REQ_APF_READY, vcpu);
/*
* Tell vCPU to wake up before checking if they need an
* interrupt. Pairs with any memory barrier between
* the clearing of pageready_pending and vCPU entry.
*/
smp_mb__after_atomic();
if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
kvm_vcpu_kick(vcpu);
- in kvm_set_msr_common(), there are two possibilities.
The easy one is to just use smp_store_mb() to clear
vcpu->arch.apf.pageready_pending. The other would be a comment
like this:
WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
/*
* Ensure they know to wake this vCPU up, before the vCPU
* next checks KVM_REQ_APF_READY. Use an existing memory
* barrier between here and thenext kvm_request_pending(),
* for example in vcpu_run().
*/
/* smp_mb(); */
plus a memory barrier in common code like this:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 706b6fd56d3c..e302c617e4b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
if (r <= 0)
break;
+ /*
+ * Provide a memory barrier between handle_exit and the
+ * kvm_request_pending() read in vcpu_enter_guest(). It
+ * pairs with any barrier after kvm_make_request(), for
+ * example in kvm_arch_async_page_present_queued().
+ */
+ smp_mb__before_atomic();
kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
if (kvm_xen_has_pending_events(vcpu))
kvm_xen_inject_pending_events(vcpu);
The only advantage of this second, more complex approach is that
it shows *why* the race was not happening. The 50 clock cycles
saved on an MSR write are not worth the extra complication, and
on a quick grep I could not find other cases which rely on the same
implicit barriers. So I'd say use smp_store_mb(), with a comment
about the pairing with kvm_arch_async_page_present_queued(); and write
in the commit message that the race wasn't happening thanks to unrelated
memory barriers between handle_exit and the kvm_request_pending()
read in vcpu_enter_guest.
Thanks,
Paolo
Powered by blists - more mailing lists