lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240708055559.rl4w5xfhj3uru6j2@yy-desk-7060>
Date: Mon, 8 Jul 2024 13:55:59 +0800
From: Yuan Yao <yuan.yao@...ux.intel.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: Sean Christopherson <seanjc@...gle.com>, isaku.yamahata@...el.com,
	pbonzini@...hat.com, erdemaktas@...gle.com, vkuznets@...hat.com,
	vannapurve@...gle.com, jmattson@...gle.com, mlevitsk@...hat.com,
	xiaoyao.li@...el.com, chao.gao@...el.com,
	rick.p.edgecombe@...el.com, yuan.yao@...el.com, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org, "Hao, Xudong" <xudong.hao@...el.com>
Subject: Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX

On Wed, Jul 03, 2024 at 01:14:09PM -0700, Reinette Chatre wrote:
> Hi Sean,
>
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
> > Forking this off to try and avoid confusion...
> >
> > On Wed, Jun 12, 2024, Reinette Chatre wrote:
>
> ...
>
> > > +
> > > +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> > > +		/* Check if measured frequency is within 1% of configured frequency. */
> > > +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> > > +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> > > +	}
> >
> > This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> > bug.  It looks like something APICv related is clobbering internal VMX timer state?
> > Or maybe there's a tearing or truncation issue?
>
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.
>
> Reinette
>

I can reproduce this on my side, even with apicv disabled by
'insmod $kernel_path/arch/x86/kvm/kvm-intel.ko enable_apicv=N'.
@Sean I think we're observing same issue, please see the details
below:

apic_bus_clock_test can't reproduce this may because the
preemption timer value calculation relies on apic bus
frequency/TMICT/Divide count/host TSC frequency and ratio
between preemption timer and host TSC frequency, too many
factors to generate the 'magic' value there. So I changed
KVM and added a small KVM kselftest tool to set the
preemption timer value directly from guest, this makes the
reproducing easily. The changes are attached at end of this
comment.

The trace I captured below came form host with 1.7GHz TSC,
the VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is enabled to get the
cpu saved vmcs.VMX_PREEMPTION_TIMER_VALUE after VMEXIT. I
set the vmcs.VMX_PREEMPTION_TIMER_VALUE to 0x880042ad which
is the 'magic' number on this 1.7Ghz TSC machine:

preempt_test  20677.199521: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5554 a1:0x880042ad a2:0x880042ad a3:0x20462e98d9b9
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The value read back from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, for double confirmation.
  a3: The host tsc at the time point, debug only.

preempt_test  20677.199579:      kvm:kvm_exit: reason PREEMPTION_TIMER rip 0x40274d info 0 0 intr 0

preempt_test  20677.199579: kvm:kvm_vmx_debug: kvm_vmx_debug: 2, a0:0x34 a1:0x0 a2:0x87fea9b0 a3:0x20462e9a5749
  a0: The VMEXIT reason, 0x34 is preemption timer VMEXIT.
  a1: The read back vmcs.VMX_PREEMPTION_TIMER_VALUE value, here 0.
  a2: The next preemption timer value should be written to
      vmcs, calculates from the (target tsc - current tsc) >>
      7. Now the preemption timer vmexit happend only after
      ~58 microseconds elapsed, it should happen after
      ~171.79 seconds but not such soon, the issue is
      reproduced.

Another more easy way to observe this symptom w/o care the
'magic' preemption timer vlaue is use the maximum preemption
timer value 0xffffffff, below log w/ 0xffffffff is captured
from same machine:

preempt_test 20530.456589: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5551 a1:0xffffffff a2:0xffffffff a3:0x200c1971ca5d

  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The read back value from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, double confirmation.
  a3: the host tsc at the time point, debug only.

preempt_test 20530.456690:      kvm:kvm_exit: reason VMCALL rip 0x4131a0 info 0 0 intr 0
  The preempt_test checks preemption timer state every
  100us, this VMEXIT is expected behavior.

preempt_test 20530.456691:     kvm:kvm_entry: vcpu 0, rip 0x4131a3

preempt_test 20530.456691: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77ff82cc a1:0xfffe900b a2:0xfffe900b a3:0x200c19746d45
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT. The difference value
      shouldn't be such huge number 0xffffffff - 0x77ff82cc
      = 0x88007D33 when just ~100us elapsed from previous
      VMENTRY, the issue is reproduced.

Use 0x88000000 as preemption timer value to verify this
preempt_test tool, the preemption timer VMEXIT happend after
~171.45 seconds which is expected behavior on the host
1.7Ghz TSC system:

The preemption timer VMEXIT should happen after:
2281701376 × 128 / 1700000000 = 171.79 seconds.

Attached my changes in KVM and kselftest tool for
reproducing here, based on:
https://github.com/kvm-x86/linux.git
tag:kvm-x86-next-2024.06.28

Patch 01:

>From a977bf12a8cd1bbe401e68d3702c0b5aa3bf66e4 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@...el.com>
Date: Thu, 4 Jul 2024 09:59:55 +0800
Subject: [PATCH 1/2] KVM: x86: Introudce trace_kvm_vmx_debug()

debug only common trace.

Signed-off-by: Yao Yuan <yuan.yao@...el.com>
---
 arch/x86/kvm/trace.h | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c   |  1 +
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index d3aeffd6ae75..7b9eb23d71d3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -34,6 +34,34 @@ TRACE_EVENT(kvm_entry,
 		  __entry->immediate_exit ? "[immediate exit]" : "")
 );

+TRACE_EVENT(kvm_vmx_debug,
+	    TP_PROTO(unsigned long n, unsigned long a0,
+		     unsigned long a1,
+		     unsigned long a2,
+		     unsigned long a3),
+	    TP_ARGS(n, a0, a1, a2, a3),
+
+	TP_STRUCT__entry(
+		 __field(	unsigned long,	n		)
+		__field(	unsigned long,	a0		)
+		__field(	unsigned long,	a1		)
+		__field(	unsigned long,	a2		)
+		__field(	unsigned long,	a3		)
+	),
+
+	TP_fast_assign(
+		__entry->n      = n;
+		__entry->a0		= a0;
+		__entry->a1		= a1;
+		__entry->a2		= a2;
+		__entry->a3		= a3;
+	),
+
+	TP_printk("kvm_vmx_debug: %ld, a0:0x%lx a1:0x%lx a2:0x%lx a3:0x%lx",
+		  __entry->n, __entry->a0, __entry->a1, __entry->a2, __entry->a3)
+);
+
+
 /*
  * Tracepoint for hypercall.
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..6d1972d6c988 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14036,6 +14036,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmx_debug);

 static int __init kvm_x86_init(void)
 {
--
2.27.0

Patch 02:

>From a3bdce4f2f93810372f6068396776ac702946d16 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@...el.com>
Date: Wed, 3 Jul 2024 14:08:02 +0800
Subject: [PATCH 2/2] [DEBUG] preempt timer debug test

A specific kselftesting based program to allow set the VMX
preempt timer value from VM directly.

Introduce 2 hypercall 0x56780001/2, 01 to set the preempt
timer value, 02 to wait for the preemption time expired.

Usage:
Reload kvm applied this change, then:
$KRNEL_SRC_ROOT/tools/testing/selftests/kvm/x86_64/preempt_test -p 'preempt_timer_vale'

'preempt_timer_vale' is the preempt timer value in DEC format, HEX is not supported.

For example:

perf record -e "kvm:*" tools/testing/selftests/kvm/x86_64/preempt_test -p 2281718445

Above set the preempt value to 2281718445(0x880042AD) and
capture the trace, then check the kvm_vmx_debug in the trace
to know the preempt timer behavior.

Signed-off-by: Yao Yuan <yuan.yao@...el.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 arch/x86/kvm/vmx/vmx.h                        |   5 +
 arch/x86/kvm/vmx/vmx.c                        | 113 +++++++++++++++++-
 .../selftests/kvm/x86_64/preempt_test.c       |  82 +++++++++++++
 4 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/preempt_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ad8b5d15f2bd..957509957f80 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -129,6 +129,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/amx_test
 TEST_GEN_PROGS_x86_64 += x86_64/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86_64 += x86_64/triple_fault_event_test
 TEST_GEN_PROGS_x86_64 += x86_64/recalc_apic_map_test
+TEST_GEN_PROGS_x86_64 += x86_64/preempt_test
 TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..82ea0ccc7a63 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -368,6 +368,11 @@ struct vcpu_vmx {

 	/* ve_info must be page aligned. */
 	struct vmx_ve_information *ve_info;
+
+	volatile bool debug_timer;
+	bool debug_timer_set_to_hardware;
+	u32 debug_timer_val;
+	u64 debug_timer_deadline_tsc;
 };

 struct kvm_vmx {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..73f084c29f9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,9 @@ static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
+	pr_info("Set VM_EXIT_SAVE_VMX_PREEMPTION_TIMER forcedly for preempt timer debug\n");
+

 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5993,11 +5994,41 @@ static int handle_pml_full(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static fastpath_t handle_fastpath_debug_timer(struct kvm_vcpu *vcpu,
+					      bool force_immediate_exit)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u64 tscl;
+	u32 delta;
+
+	tscl = rdtsc();
+
+	if (vmx->debug_timer_deadline_tsc > tscl)
+		delta = (u32)((vmx->debug_timer_deadline_tsc - tscl) >>
+			      cpu_preemption_timer_multi);
+	else
+		delta = 0;
+
+	trace_kvm_vmx_debug(2UL,
+			    (unsigned long)vmcs_read32(VM_EXIT_REASON),
+			    (unsigned long)vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+			    (unsigned long)delta, tscl);
+
+	vmx->debug_timer = false;
+
+	return EXIT_FASTPATH_REENTER_GUEST;
+}
+
 static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);

+	WARN_ON(vmx->debug_timer && force_immediate_exit);
+	if (vmx->debug_timer)
+		return handle_fastpath_debug_timer(vcpu,
+						   force_immediate_exit);
+
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
 	 * due to the timer expiring while it was "soft" disabled, just eat the
@@ -6096,6 +6127,60 @@ static int handle_notify(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static unsigned long vmx_debug_set_preempt_timer(struct kvm_vcpu *vcpu,
+						 unsigned long a0,
+						 unsigned long a1,
+						 unsigned long a2,
+						 unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->debug_timer = true;
+	vmx->debug_timer_set_to_hardware = false;
+	vmx->debug_timer_val = a0;
+	vmx->debug_timer_deadline_tsc = rdtsc() + (a0 << cpu_preemption_timer_multi);
+	pr_info("debug_timer = %u\n", (u32)a0);
+
+	return 0;
+}
+
+
+static unsigned long vmx_debug_get_preempt_timer_result(struct kvm_vcpu *vcpu,
+							unsigned long a0,
+							unsigned long a1,
+							unsigned long a2,
+							unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (vmx->debug_timer)
+		return 1;
+	return 0;
+}
+
+static int vmx_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3;
+	unsigned long ret;
+
+	nr = kvm_rax_read(vcpu);
+	if (nr != 0x87650001 && nr != 0x87650002)
+		return kvm_emulate_hypercall(vcpu);
+
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+
+	if (nr == 0x87650001)
+		ret = vmx_debug_set_preempt_timer(vcpu, a0, a1, a2, a3);
+	else
+		ret = vmx_debug_get_preempt_timer_result(vcpu, a0, a1, a2, a3);
+
+	kvm_rax_write(vcpu, ret);
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6117,7 +6202,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_INVD]		      = kvm_emulate_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
-	[EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
+	[EXIT_REASON_VMCALL]                  = vmx_emulate_hypercall,
 	[EXIT_REASON_VMCLEAR]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMLAUNCH]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMPTRLD]		      = handle_vmx_instruction,
@@ -7199,6 +7284,28 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	if (force_immediate_exit) {
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
+	} else if (vmx->debug_timer) {
+		u32 old;
+
+		tscl = rdtsc();
+
+		if (!vmx->debug_timer_set_to_hardware) {
+			delta_tsc = vmx->debug_timer_val;
+			vmx->debug_timer_set_to_hardware = true;
+		} else {
+			if (vmx->debug_timer_deadline_tsc > tscl)
+				delta_tsc = (u32)((vmx->debug_timer_deadline_tsc - tscl)
+						  >> cpu_preemption_timer_multi);
+			else
+				delta_tsc = 0;
+		}
+
+		old = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
+		trace_kvm_vmx_debug(3UL, old,
+				    vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+				    delta_tsc, tscl);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
 		tscl = rdtsc();
 		if (vmx->hv_deadline_tsc > tscl)
diff --git a/tools/testing/selftests/kvm/x86_64/preempt_test.c b/tools/testing/selftests/kvm/x86_64/preempt_test.c
new file mode 100644
index 000000000000..2e58cfee61d0
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/preempt_test.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024 Intel Corporation
+ *
+ * Debug the preemption timer behavior
+ */
+
+#include "test_util.h"
+#include "processor.h"
+#include "ucall_common.h"
+
+uint32_t preempt_timer_val = 0x1000000;
+static void guest_code(uint64_t apic_hz, uint64_t delay_ms)
+{
+	volatile unsigned long r;
+
+	kvm_hypercall(0x87650001, preempt_timer_val, 0, 0, 0);
+	do {
+		udelay(100);
+		r = kvm_hypercall(0x87650002, 0, 0, 0, 0);
+	} while(r != 0);
+
+	GUEST_DONE();
+}
+
+static void do_test(struct kvm_vcpu *vcpu)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			break;
+		}
+	}
+}
+
+static void run_test(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	vm = vm_create(1);
+
+	sync_global_to_guest(vm, preempt_timer_val);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code);
+
+	do_test(vcpu);
+
+	kvm_vm_free(vm);
+}
+
+
+int main(int argc, char *argv[])
+{
+	int opt;
+
+	while ((opt = getopt(argc, argv, "p:h")) != -1) {
+		switch (opt) {
+		case 'p':
+			preempt_timer_val = atoi(optarg);
+			break;
+		default:
+			exit(KSFT_SKIP);
+		}
+	}
+
+	printf("preempt timer value:%u(0x%x)\n",
+	       preempt_timer_val, preempt_timer_val);
+
+	run_test();
+}
--
2.27.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ