lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251107034802.39763-2-fuqiang.wng@gmail.com>
Date: Fri,  7 Nov 2025 11:48:00 +0800
From: fuqiang wang <fuqiang.wng@...il.com>
To: Sean Christopherson <seanjc@...gle.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	x86@...nel.org,
	Marcelo Tosatti <mtosatti@...hat.com>,
	"H . Peter Anvin" <hpa@...or.com>,
	Maxim Levitsky <mlevitsk@...hat.com>,
	kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Cc: fuqiang wang <fuqiang.wng@...il.com>,
	yu chen <33988979@....com>,
	dongxu zhang <xu910121@...a.com>
Subject: [PATCH v5 1/1] KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV timer

When a VM is suspended while using the periodic HV timer, the KVM timer
also ceases to advance. After the VM resumes from a prolonged suspend,
there will be a huge gap between target_expiration and the current time.
Because target_expiration is incremented by only one period on each KVM
timer expiration, this leads to a series of KVM timer expirations occurring
rapidly after the VM resumes.

More critically, when the VM first triggers a periodic HV timer expiration
after resuming, executing advance_periodic_target_expiration() advance
target_expiration by one period, but it will still be earlier than the
current time (now).  As a result, delta may be calculated as a negative
value. Subsequently, nsec_to_cycles() convert this delta into an absolute
value larger than guest_l1_tsc, resulting in a negative tscdeadline. Since
the hv timer supports a maximum bit width of cpu_preemption_timer_multi +
32, this causes the hv timer setup to fail and switch to the sw timer.

After switching to the software timer, periodic timer expiration callbacks
may be executed consecutively within a single clock interrupt handler, with
interrupts disabled until target_expiration is advanced to now. If this
situation persists for an extended period, it could result in a hard
lockup.

Here is a stack trace from a Windows VM that encountered a hard lockup
after resuming from a long suspend.

  NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
  ...
  RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
  ...
  RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
  RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
  RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
  RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
  R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
  R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
  FS:  00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
  PKRU: 55555554
  Call Trace:
   <IRQ>
   apic_timer_fn+0x31/0x50 [kvm]
   __hrtimer_run_queues+0x100/0x280
   hrtimer_interrupt+0x100/0x210
   ? ttwu_do_wakeup+0x19/0x160
   smp_apic_timer_interrupt+0x6a/0x130
   apic_timer_interrupt+0xf/0x20
   </IRQ>

Moreover, if the suspend duration of the virtual machine is not long enough
to trigger a hard lockup in this scenario, due to the commit 98c25ead5eda
("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), if the
guest is using the sw timer before blocking, it will continue to use the sw
timer after being woken up, and will not switch back to the hv timer until
the relevant APIC timer register is reprogrammed.  Since the periodic timer
does not require frequent APIC timer register programming, the guest may
continue to use the software timer for an extended period.

This patch makes the following modification: When handling KVM periodic
timer expiration, if we find that the advanced target_expiration is still
less than now, we set target_expiration directly to now (just like how
update_target_expiration handles the remaining).

Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
Signed-off-by: fuqiang wang <fuqiang.wng@...il.com>
---
 arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0ae7f913d782..bc082271c81c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2131,18 +2131,34 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
 	ktime_t delta;
 
 	/*
-	 * Synchronize both deadlines to the same time source or
-	 * differences in the periods (caused by differences in the
-	 * underlying clocks or numerical approximation errors) will
-	 * cause the two to drift apart over time as the errors
-	 * accumulate.
+	 * Use kernel time as the time source for both deadlines so that they
+	 * stay synchronized.  Computing each deadline independently will cause
+	 * the two deadlines to drift apart over time as differences in the
+	 * periods accumulate, e.g. due to differences in the underlying clocks
+	 * or numerical approximation errors.
 	 */
 	apic->lapic_timer.target_expiration =
 		ktime_add_ns(apic->lapic_timer.target_expiration,
 				apic->lapic_timer.period);
+
+	/*
+	 * When the vm is suspend, the hv timer also stops advancing. After it
+	 * is resumed, this may result in a large delta. If the
+	 * target_expiration only advances by one period each time, it will
+	 * cause KVM to frequently handle timer expirations.
+	 */
+	if (apic->lapic_timer.period > 0 &&
+	    ktime_before(apic->lapic_timer.target_expiration, now))
+		apic->lapic_timer.target_expiration = now;
+
 	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
-	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
-		nsec_to_cycles(apic->vcpu, delta);
+	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
+	/*
+	 * Note: delta must not be negative. Otherwise, blindly adding a
+	 * negative delta could cause the deadline to become excessively large
+	 * due to the tscdeadline being an unsigned value.
+	 */
+	apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
 }
 
 static void start_sw_period(struct kvm_lapic *apic)
@@ -2972,7 +2988,7 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data)
 
 	if (lapic_is_periodic(apic)) {
 		advance_periodic_target_expiration(apic);
-		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
+		hrtimer_set_expires(&ktimer->timer, ktimer->target_expiration);
 		return HRTIMER_RESTART;
 	} else
 		return HRTIMER_NORESTART;
-- 
2.47.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ