lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251107034802.39763-1-fuqiang.wng@gmail.com>
Date: Fri,  7 Nov 2025 11:47:59 +0800
From: fuqiang wang <fuqiang.wng@...il.com>
To: Sean Christopherson <seanjc@...gle.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	x86@...nel.org,
	Marcelo Tosatti <mtosatti@...hat.com>,
	"H . Peter Anvin" <hpa@...or.com>,
	Maxim Levitsky <mlevitsk@...hat.com>,
	kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Cc: fuqiang wang <fuqiang.wng@...il.com>,
	yu chen <33988979@....com>,
	dongxu zhang <xu910121@...a.com>
Subject: [PATCH v5 0/1] KVM: x86: fix some kvm period timer BUG

This patch fixes two issues with the period timer:

====================================================================
issue 1: avoid hv timer fallback to sw timer if delay exceeds period 
====================================================================

When the guest uses the APIC periodic timer, if the next period has already
expired, e.g. due to the period being smaller than the delay in processing
the timer, the delta will be negative. nsec_to_cycles() may then convert
this delta into an absolute value larger than guest_l1_tsc, resulting in a
negative tscdeadline. Since the hv timer supports a maximum bit width of
cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
switch to the sw timer.

Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
<=> hrtimer dance to common x86"), if the guest is using the sw timer
before blocking, it will continue to use the sw timer after being woken up,
and will not switch back to the hv timer until the relevant APIC timer
register is reprogrammed.  Since the periodic timer does not require
frequent APIC timer register programming, the guest may continue to use the
software timer for an extended period.

Link [1] reproduces this issue by injecting a kernel module. This module
creates a periodic hrtimer and adds a certain delay in its callback, making
the delay longer than the KVM periodic timer period.

======================================================================
issue 2: VM hard lockup after prolonged suspend with periodic HV timer
======================================================================

Resuming a virtual machine after it has been suspended for a long time may
trigger a hard lockup. 

The main reason is that the KVM periodic HV timer only advances during the
VM-exit “VMX-preemption timer expired” event and  when the vCPU is
suspended or returns to user space for other reasons, the KVM timer stops
advancing. Since the periodic timer expiration callback advances the timer
by one period per invocation, this results in the callback being executed
many times to catch up the expiration to the current timer value.

Due to issue 1, the KVM periodic HV timer will switch to the software
timer, and these catch-up will be executed within a single clock interrupt.
If this process lasts long enough, it can easily lead to a hard lockup.

One of our Windows virtual machines in the production environment triggered
this case:
  NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
  ...
  RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
  ...
  RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
  RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
  RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
  RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
  R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
  R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
  FS:  00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
  PKRU: 55555554
  Call Trace:
   <IRQ>
   apic_timer_fn+0x31/0x50 [kvm]
   __hrtimer_run_queues+0x100/0x280
   hrtimer_interrupt+0x100/0x210
   ? ttwu_do_wakeup+0x19/0x160
   smp_apic_timer_interrupt+0x6a/0x130
   apic_timer_interrupt+0xf/0x20
   </IRQ>

And in link [2], Marcelo also reported this issue. But I don't think it can
reproduce the issue. Because of commit [3], as long as the KVM timer is
running, target_expiration will keep catching up to now (unless every
single delay from timer virtualization is longer than the period, which is
a pretty extreme case). Also, this patch is based on the patch of link [2],
but with some differences: In link [2], target_expiration is updated to
"now - period"(I'm not sure why it doesn't just catch up to now -- maybe
I'm missing something?). In this patch, I set target_expiration to catch up
to now just like how update_target_expiration handles the remaining.

Link [4] provides details of the hard lockup details and as well as how to
reproduce the KVM timer stop by pausing the virtual machine.

=================================
Fix both issues in a single patch
=================================

In versions v2 and v3, I split these two issues into two separate patches
for fixing. However, this caused patch 2 to revert some of the changes made
by patch 1.

In patch 4, I attempted to merge the two patches into one and tried to
describe both issues in the commit message, but I did not do it well. In
this version, I have included more details in the commit message and the
cover letter.

Changes in v5:
- Add more details in commit messages and letters.
- link to v4: https://lore.kernel.org/all/20251105135340.33335-1-fuqiang.wng@gmail.com/

Changes in v4:
- merge two patch into one
- link to v3: https://lore.kernel.org/all/20251022150055.2531-1-fuqiang.wng@gmail.com/

Changes in v3:
- Fix: advanced SW timer (hrtimer) expiration does not catch up to current
  time.
- optimize the commit message of patch 2
- link to v2: https://lore.kernel.org/all/20251021154052.17132-1-fuqiang.wng@gmail.com/

Changes in v2:
- Added a bugfix for hardlockup in v2
- link to v1: https://lore.kernel.org/all/20251013125117.87739-1-fuqiang.wng@gmail.com/

[1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test
[2]: https://lore.kernel.org/kvm/YgahsSubOgFtyorl@fuller.cnet/
[3]: commit d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
[4]: https://github.com/cai-fuqiang/md/tree/master/case/intel_kvm_period_timer

fuqiang wang (1):
  KVM: x86: Fix VM hard lockup after prolonged suspend with periodic HV
    timer

 arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

-- 
2.47.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ