[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251022150055.2531-3-fuqiang.wng@gmail.com>
Date: Wed, 22 Oct 2025 23:00:55 +0800
From: fuqiang wang <fuqiang.wng@...il.com>
To: Sean Christopherson <seanjc@...gle.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>,
Maxim Levitsky <mlevitsk@...hat.com>,
kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Cc: fuqiang wang <fuqiang.wng@...il.com>,
yu chen <33988979@....com>,
dongxu zhang <xu910121@...a.com>
Subject: [PATCH v3 2/2] fix hardlockup when waking VM after long suspend
When a virtual machine uses the HV timer during suspend, the KVM timer does
not advance. Upon waking after a long period, there may be a significant
gap between target_expiration and the current time. Since each timer
expiration only advances target_expiration by one period, the expiration
handler can be invoked repeatedly to catch up.
Prior to the previous patch, if the advanced target_expiration remained
less than the current time, tscdeadline could be set to a negative value.
This would cause HV timer setup to fail and fallback to the SW timer. After
switching to SW timer, apic_timer_fn could be repeatedly executed within a
single clock interrupt handler, resulting in a hardlockup:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
...
RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
...
RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
FS: 00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
PKRU: 55555554
Call Trace:
<IRQ>
apic_timer_fn+0x31/0x50 [kvm]
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x210
? ttwu_do_wakeup+0x19/0x160
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
With the previous patch applied, HV timer no longer falls back to SW timer.
Additionally, while target_expiration is catching up to the current time,
the VMX-preemption timer is set to 0 before each VM entry. According to
Intel SDM 27.7.4 ("VMX-Preemption Timer"), if the timer has already expired
at VM entry, a VM exit occurs before any guest instruction executes. As a
result, the guest cannot run instructions during this period and cannot
reach vcpu_block() to switch to the SW timer, preventing hardlockup.
However, unnecessary repeated catch-ups should still be avoided. Therefore,
if the advanced target_expiration is still less than the current time, we
immediately catch up to the current time in the handler.
Signed-off-by: fuqiang wang <fuqiang.wng@...il.com>
---
arch/x86/kvm/lapic.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index fa07a303767c..307e2d6c3450 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2140,17 +2140,25 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
apic->lapic_timer.target_expiration =
ktime_add_ns(apic->lapic_timer.target_expiration,
apic->lapic_timer.period);
- delta = ktime_sub(apic->lapic_timer.target_expiration, now);
/*
- * Don't adjust the tscdeadline if the next period has already expired,
- * e.g. due to software overhead resulting in delays larger than the
- * period. Blindly adding a negative delta could cause the deadline to
- * become excessively large due to the deadline being an unsigned value.
+ * When the vm is suspend, the hv timer also stops advancing. After it
+ * is resumed, this may result in a large delta. If the
+ * target_expiration only advances by one period each time, it will
+ * cause KVM to frequently handle timer expirations.
*/
+ if (apic->lapic_timer.period > 0 &&
+ ktime_before(apic->lapic_timer.target_expiration, now))
+ apic->lapic_timer.target_expiration = now;
+
+ delta = ktime_sub(apic->lapic_timer.target_expiration, now);
apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
- if (delta > 0)
- apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
+ /*
+ * Note: delta must not be negative. Otherwise, blindly adding a
+ * negative delta could cause the deadline to become excessively large
+ * due to the deadline being an unsigned value.
+ */
+ apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
}
static void start_sw_period(struct kvm_lapic *apic)
@@ -2980,7 +2988,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
if (lapic_is_periodic(apic)) {
advance_periodic_target_expiration(apic);
- hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
+ hrtimer_set_expires(&ktimer->timer, ktimer->target_expiration);
return HRTIMER_RESTART;
} else
return HRTIMER_NORESTART;
--
2.47.0
Powered by blists - more mailing lists