linux-kernel - Re: [PATCH] KVM: x86: Retry guest entry on -EBUSY from kvm_check_nested

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aV1EF5DU5e66NTK0@google.com>
Date: Tue, 6 Jan 2026 09:19:19 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Alessandro Ratti <alessandro@...5c.net>
Cc: pbonzini@...hat.com, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	syzbot+1522459a74d26b0ac33a@...kaller.appspotmail.com
Subject: Re: [PATCH] KVM: x86: Retry guest entry on -EBUSY from kvm_check_nested_events()

On Sun, Jan 04, 2026, Alessandro Ratti wrote:
> When a vCPU running in nested guest mode attempts to block (e.g., due
> to HLT), kvm_check_nested_events() may return -EBUSY to indicate that a
> nested event is pending but cannot be injected immediately, such as
> when event delivery is temporarily blocked in the guest.
> 
> Currently, vcpu_block() logs a WARN_ON_ONCE() and then treats -EBUSY
> like any other error, returning 0 to exit to userspace. This can cause
> the vCPU to repeatedly block without making forward progress, delaying
> event injection and potentially leading to guest hangs under rare timing
> conditions.
> 
> Remove the WARN_ON_ONCE() and handle -EBUSY explicitly by returning 1
> to retry guest entry instead of exiting to userspace. This allows the
> nested event to be injected once the temporary blocking condition
> clears, ensuring forward progress.
> 
> This issue was triggered by syzkaller while exercising nested
> virtualization.

Syzkaller always ruins the fun :-(

> Fixes: 45405155d876 ("KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject")
> Reported-by: syzbot+1522459a74d26b0ac33a@...kaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=1522459a74d26b0ac33a
> Tested-by: syzbot+1522459a74d26b0ac33a@...kaller.appspotmail.com
> Signed-off-by: Alessandro Ratti <alessandro@...5c.net>
> ---
>  arch/x86/kvm/x86.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ff8812f3a129..d5cf9a7ff8c5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11596,7 +11596,15 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  	if (is_guest_mode(vcpu)) {
>  		int r = kvm_check_nested_events(vcpu);
>  
> -		WARN_ON_ONCE(r == -EBUSY);
> +		/*
> +		 * -EBUSY indicates a nested event is pending but cannot be
> +		 * injected immediately (e.g., event delivery is temporarily
> +		 * blocked). Return to the vCPU run loop to retry guest entry
> +		 * instead of blocking, which would lose the pending event.
> +		 */
> +		if (r == -EBUSY)
> +			return 1;

The code and the comment are both wrong.  Returning immediately will incorrectly
leave vcpu->arch.mp_state in a non-RUNNABLE state, and _that_ will put the vCPU
into an infinite loop.  The for-loop in vcpu_run() will always see the vCPU as
!running and so will call back into vcpu_block().  vcpu_block() will see the vCPU
as _runnable_ (but still not fully running!) because of the pending (and injected)
event, check nested events again, hit -EBUSY again, and repeat until the VMM kills
the VM.

And returning '0' doesn't block the vCPU, it triggers an exit to userspace.  In
most cases, the spurious exit will be KVM_EXIT_UNKNOWN, but it could be something
else entirely if KVM filled vcpu->run->exit_reason but didn't complete the exit
to userspace.

And as above, the pending event isn't lost, it'll still be pending if userspace
invokes KVM_RUN again.  Of course, unless userspace stuff MP_STATE, the infinite
will still occur, just with userspace's KVM_RUN loop being the outermost loop
(assuming userspace doesn't simply kill the VM).

I said above that syzkaller ruins the fun because, as noted by the changelog in
the Fixes commit, this scenario _should_ be impossible.  And AFAICT, within KVM
itself, that still holds true.  I finally found one of syzbot's reproducers that
is straightforward, i.e. doesn't require hitting a timing window with threading.
In that reproducer (see Link below), userspace stuff MP_STATE and an "injected"
event, thus forcing the vCPU into what is effectively an impossible state.

All of the other reproducers get into HALTED naturally by executing HLT in L2,
and then stuff an injected event.  I've never been able to repro those, because
hitting the WARN requires forcing the vCPU to exit to userspace (e.g. with a
signal) just after HLT is executed so that userspace can stuff event state.  But
in principle it's the same scenario: userspace stuffs impossible vCPU state.

For now, I'm pretty sure the least awful "fix" is to drop the WARN and continue
with waking the vCPU.  In all likelhiood, the garbage event stuffed by userspace
will generate a failed VM-Entry, which KVM will reflect to L1.  So L2 might die,
but L1 should live on, which more than good enough when userspace is being stupid,
and is about as good as we can do if KVM itself is buggy, i.e. if there's a
legitimate KVM but that generates impossible state.

I'll post the below as part of a series, as there is at least one cleanup that
can be done on top to consolidate handling of EBUSY, and I'm hopeful that the
spirit of the WARN can be preserved, e.g. by adding/extending WARNs in paths where
KVM (re)injects events.

--
From: Sean Christopherson <seanjc@...gle.com>
Date: Tue, 6 Jan 2026 07:46:38 -0800
Subject: [PATCH] KVM: x86: Ignore -EBUSY when checking nested events from
 vcpu_block()

Ignore -EBUSY when checking nested events after exiting a blocking state
while L2 is active, as exiting to userspace will generate a spurious
userspace exit, usually with KVM_EXIT_UNKNOWN, and likely lead to the VM's
demise.  Continuing with the wakeup isn't perfect either, as *something*
has gone sideways if a vCPU is awakened in L2 with an injected event (or
worse, a nested run pending), but continuing on gives the VM a decent
chance of surviving without any major side effects.

As explained in the Fixes commits, it _should_ be impossible for a vCPU to
be put into a blocking state with an already-injected event (exception,
IRQ, or NMI).  Unfortunately, userspace can stuff MP_STATE and/or injected
events, and thus put the vCPU into what should be an impossible state.

Don't bother trying to preserve the WARN, e.g. with an anti-syzkaller
Kconfig, as WARNs can (hopefully) be added in paths where _KVM_ would be
violating x86 architecture, e.g. by WARNing if KVM attempts to inject an
exception or interrupt while the vCPU isn't running.

Cc: Alessandro Ratti <alessandro@...5c.net>
Cc: stable@...r.kernel.org
Fixes: 26844fee6ade ("KVM: x86: never write to memory from kvm_vcpu_check_block()")
Fixes: 45405155d876 ("KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject")
Link: https://syzkaller.appspot.com/text?tag=ReproC&x=10d4261a580000
Reported-by: syzbot+1522459a74d26b0ac33a@...kaller.appspotmail.com
Closes: https://lore.kernel.org/all/671bc7a7.050a0220.455e8.022a.GAE@google.com
Signed-off-by: Sean Christopherson <seanjc@...gle.com>
---
 arch/x86/kvm/x86.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff8812f3a129..4bf9be1e17a7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11596,8 +11596,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 	if (is_guest_mode(vcpu)) {
 		int r = kvm_check_nested_events(vcpu);
 
-		WARN_ON_ONCE(r == -EBUSY);
-		if (r < 0)
+		if (r < 0 && r != -EBUSY)
 			return 0;
 	}
 

base-commit: 9448598b22c50c8a5bb77a9103e2d49f134c9578
--