linux-kernel - Re: [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKZrrW77QzB6ee10@google.com>
Date: Wed, 20 Aug 2025 17:43:25 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>, 
	David Woodhouse <dwmw2@...radead.org>, Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>, Tzung-Bi Shih <tzungbi@...nel.org>, 
	John Stultz <jstultz@...gle.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	ssouhlal@...ebsd.org
Subject: Re: [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time

On Tue, Jul 22, 2025, Suleiman Souhlal wrote:
> Introduce MSR_KVM_SUSPEND_STEAL which controls whether or not a guest
> wants the duration of host suspend to be included in steal time.
>
> This lets guests subtract the duration during which the host was
> suspended from the runtime of tasks that were running over the suspend,
> in order to prevent cases where host suspend causes long runtimes in
> guest tasks, even though their effective runtime was much shorter.

This is far too terse.  There are a _lot_ of subtleties and wrinkles in this
code need to be captured.  A persistent reader can find them in the lore links,
but I want to cover the main points in the changelog.

I actually don't mind putting that together, it's a good way to remind myself of
what this is so doing, so feel free to ignore this and I can fixup when applying.

> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index a1efa7907a0b..678ebc3d7eeb 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -36,6 +36,7 @@
>  #define KVM_FEATURE_MSI_EXT_DEST_ID	15
>  #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
>  #define KVM_FEATURE_MIGRATION_CONTROL	17
> +#define KVM_FEATURE_SUSPEND_STEAL	18
>  
>  #define KVM_HINTS_REALTIME      0
>  
> @@ -58,6 +59,7 @@
>  #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
>  #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
>  #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
> +#define MSR_KVM_SUSPEND_STEAL	0x4b564d09

I'm pretty sure we don't need a new MSR.  There are 4 bits (I think; KVM's
#defines are horrific) in MSR_KVM_STEAL_TIME that are currently reserved, we
can simply grab one of those.

Using MSR_KVM_STEAL_TIME also makes it more obvious that SUSPEND_STEAL has a
hard dependency on STEAL_TIME being enabled.

> @@ -7027,13 +7045,52 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
>  {
>  	struct kvm_vcpu *vcpu;
>  	unsigned long i;
> +	bool kick_vcpus = false;
>  
> -	/*
> -	 * Ignore the return, marking the guest paused only "fails" if the vCPU
> -	 * isn't using kvmclock; continuing on is correct and desirable.
> -	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.msr_kvm_suspend_steal & KVM_MSR_ENABLED) {
> +			kick_vcpus = true;
> +			WRITE_ONCE(vcpu->arch.st.suspend_ts,
> +				   ktime_get_boottime_ns());
> +		}
> +		/*
> +		 * Ignore the return, marking the guest paused only "fails" if
> +		 * the vCPU isn't using kvmclock; continuing on is correct and
> +		 * desirable.
> +		 */
>  		(void)kvm_set_guest_paused(vcpu);
> +	}
> +

This should have a comment.  I'm pretty sure I suggested this idea, and it still
took me a good 5 minutes to figure out what this code was doing.

	/*
	 * Force vCPUs to exit from KVM's inner run loop to ensure they see
	 * suspend_ts and block until KVM's resume notifier runs.
	 */


> +	if (kick_vcpus)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int

Do not wrap before the function name.  Linus has a nice explanation/rant on this:

https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=GhjSY6nFmuB_wA8fYd5iLBNXjo9Bw@mail.gmail.com

> +kvm_arch_resume_notifier(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	unsigned long i;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		u64 suspend_ns = ktime_get_boottime_ns() -
> +				 vcpu->arch.st.suspend_ts;
> +
> +		WRITE_ONCE(vcpu->arch.st.suspend_ts, 0);
> +
> +		/*
> +		 * Only accumulate the suspend time if suspend steal-time is
> +		 * enabled, but always clear suspend_ts and kick the vCPU as
> +		 * the vCPU could have disabled suspend steal-time after the
> +		 * suspend notifier grabbed suspend_ts.
> +		 */
> +		if (vcpu->arch.msr_kvm_suspend_steal & KVM_MSR_ENABLED)

This should arguably also check STEAL_TIME is enabled.  And regardless of which
MSR is used, this needs to be READ_ONCE() since the vCPU could modify the value
while KVM is running this code.

The other option would be to only check the MSR from within KVM_RUN, but getting
that right would probably be even harder, and in practice a false negative/positive
is a non-issue.

Related to toggling the MSR, arguably arch.st.suspend_ns should be zeroed if
SUSPEND_STEAL is disabled, but I don't expect a guest to ever disable SUSPEND_STEAL
and also re-enabled it at a later time, _and_ AFAICT STEAL_TIME doesn't work that
way either, i.e. it keeps accumulating stolen time and throws it all at the guest
if/when STEAL_TIME is enabled.

> +			atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
> +
> +		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
> +		kvm_vcpu_kick(vcpu);
> +	}

Compile tested only, but this?

---
 Documentation/virt/kvm/x86/cpuid.rst |  3 +
 Documentation/virt/kvm/x86/msr.rst   |  6 ++
 arch/x86/include/asm/kvm_host.h      |  2 +
 arch/x86/include/uapi/asm/kvm_para.h |  2 +
 arch/x86/kvm/cpuid.c                 |  4 +-
 arch/x86/kvm/x86.c                   | 83 +++++++++++++++++++++++++---
 6 files changed, 92 insertions(+), 8 deletions(-)

diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
index bda3e3e737d7..0fa96a134718 100644
--- a/Documentation/virt/kvm/x86/cpuid.rst
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -103,6 +103,9 @@ KVM_FEATURE_HC_MAP_GPA_RANGE       16          guest checks this feature bit bef
 KVM_FEATURE_MIGRATION_CONTROL      17          guest checks this feature bit before
                                                using MSR_KVM_MIGRATION_CONTROL
 
+KVM_FEATURE_SUSPEND_STEAL          18          guest checks this feature bit
+                                               before enabling KVM_STEAL_SUSPEND_TIME
+
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                                per-cpu warps are expected in
                                                kvmclock
diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 3aecf2a70e7b..367842311ed3 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -296,6 +296,12 @@ data:
 		the amount of time in which this vCPU did not run, in
 		nanoseconds. Time during which the vcpu is idle, will not be
 		reported as steal time.
+		If the guest sets KVM_STEAL_SUSPEND_TIME, steal time includes
+		the duration during which the host is suspended.  The case
+		where the host suspends during a VM migration might not be
+		accounted if VCPUs don't enter the guest post-resume.  A
+		workaround would be for the VMM to ensure that the guest is
+		entered with KVM_RUN after resuming from suspend.
 
 	preempted:
 		indicate the vCPU who owns this struct is running or
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0d3cc0fc27af..8e11d08dc701 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -944,6 +944,8 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
+		u64 suspend_ts;
+		atomic64_t suspend_ns;
 		struct gfn_to_hva_cache cache;
 	} st;
 
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..1bbe2db93890 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -36,6 +36,7 @@
 #define KVM_FEATURE_MSI_EXT_DEST_ID	15
 #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
 #define KVM_FEATURE_MIGRATION_CONTROL	17
+#define KVM_FEATURE_SUSPEND_STEAL	18
 
 #define KVM_HINTS_REALTIME      0
 
@@ -80,6 +81,7 @@ struct kvm_clock_pairing {
 	__u32 pad[9];
 };
 
+#define KVM_STEAL_SUSPEND_TIME			(1 << 1)
 #define KVM_STEAL_ALIGNMENT_BITS 5
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ad6cadf09930..9dbb3899bee7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1626,8 +1626,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
 			     (1 << KVM_FEATURE_ASYNC_PF_INT);
 
-		if (sched_info_on())
+		if (sched_info_on()) {
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
+			entry->eax |= (1 << KVM_FEATURE_SUSPEND_STEAL);
+		}
 
 		entry->ebx = 0;
 		entry->ecx = 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7ba2cdfdac44..cae1c5a8eb99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3784,6 +3784,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	if (unlikely(atomic64_read(&vcpu->arch.st.suspend_ns)))
+		steal += atomic64_xchg(&vcpu->arch.st.suspend_ns, 0);
 	unsafe_put_user(steal, &st->steal, out);
 
 	version += 1;
@@ -4052,14 +4054,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			kvm_check_async_pf_completion(vcpu);
 		}
 		break;
-	case MSR_KVM_STEAL_TIME:
+	case MSR_KVM_STEAL_TIME: {
+		u64 reserved = KVM_STEAL_RESERVED_MASK;
+
 		if (!guest_pv_has(vcpu, KVM_FEATURE_STEAL_TIME))
 			return 1;
 
 		if (unlikely(!sched_info_on()))
 			return 1;
 
-		if (data & KVM_STEAL_RESERVED_MASK)
+		if (guest_pv_has(vcpu, KVM_FEATURE_SUSPEND_STEAL))
+			reserved &= ~KVM_STEAL_SUSPEND_TIME;
+
+		if (data & reserved)
 			return 1;
 
 		vcpu->arch.st.msr_val = data;
@@ -4070,6 +4077,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
 		break;
+	}
 	case MSR_KVM_PV_EOI_EN:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_EOI))
 			return 1;
@@ -6858,18 +6866,66 @@ long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
 }
 #endif
 
+static bool kvm_is_suspend_steal_enabled(struct kvm_vcpu *vcpu)
+{
+	u64 val = READ_ONCE(vcpu->arch.st.msr_val);
+
+	return (val & KVM_MSR_ENABLED) && (val & KVM_STEAL_SUSPEND_TIME);
+}
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_arch_suspend_notifier(struct kvm *kvm)
 {
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
+	bool kick_vcpus = false;
 
-	/*
-	 * Ignore the return, marking the guest paused only "fails" if the vCPU
-	 * isn't using kvmclock; continuing on is correct and desirable.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (kvm_is_suspend_steal_enabled(vcpu)) {
+			kick_vcpus = true;
+			WRITE_ONCE(vcpu->arch.st.suspend_ts,
+				   ktime_get_boottime_ns());
+		}
+		/*
+		 * Ignore the return, marking the guest paused only "fails" if
+		 * the vCPU isn't using kvmclock; continuing on is correct and
+		 * desirable.
+		 */
 		(void)kvm_set_guest_paused(vcpu);
+	}
+
+	/*
+	 * Force vCPUs to exit from KVM's inner run loop to ensure they see
+	 * suspend_ts and block until KVM's resume notifier runs.
+	 */
+	if (kick_vcpus)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+
+	return NOTIFY_DONE;
+}
+
+static int kvm_arch_resume_notifier(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		u64 suspend_ns = ktime_get_boottime_ns() -
+				 vcpu->arch.st.suspend_ts;
+		WRITE_ONCE(vcpu->arch.st.suspend_ts, 0);
+
+		/*
+		 * Only accumulate the suspend time if suspend steal-time is
+		 * enabled, but always clear suspend_ts and kick the vCPU as
+		 * the vCPU could have disabled suspend steal-time after the
+		 * suspend notifier grabbed suspend_ts.
+		 */
+		if (kvm_is_suspend_steal_enabled(vcpu))
+			atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
+
+		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
 
 	return NOTIFY_DONE;
 }
@@ -6880,6 +6936,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 	case PM_HIBERNATION_PREPARE:
 	case PM_SUSPEND_PREPARE:
 		return kvm_arch_suspend_notifier(kvm);
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		return kvm_arch_resume_notifier(kvm);
 	}
 
 	return NOTIFY_DONE;
@@ -11104,6 +11163,16 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 static bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * During host SUSPEND/RESUME tasks get frozen after SUSPEND notifiers
+	 * run, and thawed before RESUME notifiers, i.e. vCPUs can be actively
+	 * running when KVM sees the system as suspended.  Block the vCPU if
+	 * KVM sees the vCPU as suspended to ensure the suspend steal time is
+	 * accounted before the guest can run, and to the correct guest task.
+	 */
+	if (READ_ONCE(vcpu->arch.st.suspend_ts))
+		return false;
+
 	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
 		!vcpu->arch.apf.halted);
 }

base-commit: 91b392ada892a2e8b1c621b9493c50f6fb49880f
--