linux-kernel - Re: [PATCH v5 2/2] KVM: x86: Include host suspended time in steal time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aBQdKyCI0fC5T8U-@google.com>
Date: Thu, 1 May 2025 18:17:31 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>, 
	David Woodhouse <dwmw2@...radead.org>, Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	ssouhlal@...ebsd.org
Subject: Re: [PATCH v5 2/2] KVM: x86: Include host suspended time in steal time

On Tue, Mar 25, 2025, Suleiman Souhlal wrote:
> When the host resumes from a suspend, the guest thinks any task
> that was running during the suspend ran for a long time, even though
> the effective run time was much shorter, which can end up having
> negative effects with scheduling.
> 
> To mitigate this issue, the time that the host was suspended is included
> in steal time, which lets the guest subtract the duration from the
> tasks' runtime.
> 
> In order to implement this behavior, once the suspend notifier fires,
> vCPUs trying to run will block until the resume notifier finishes. This is
> because the freezing of userspace tasks happens between these two points.
> It means that vCPUs could otherwise run and get their suspend steal
> time misaccounted, particularly if a vCPU would run after resume before
> the resume notifier fires.
> Incidentally, doing this also addresses a potential race with the
> suspend notifier setting PVCLOCK_GUEST_STOPPED, which could then get
> cleared before the suspend actually happened.
> 
> One potential caveat is that in the case of a suspend happening during
> a VM migration, the suspend time might not be accounted for.
> A workaround would be for the VMM to ensure that the guest is entered
> with KVM_RUN after resuming from suspend.

Please rewrite this to state what changes are being made in impreative mood, as
commands.  Describing the _effects_ of a change makes it extremely difficult to
understand whether the behavior is pre-patch or post-patch.

E.g. for this

  vCPUs trying to run will block until the resume notifier finishes

I had to look at the code to understand what this was saying, which largely
defeats the purpose of the changelog.

> Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> ---
>  Documentation/virt/kvm/x86/msr.rst | 10 ++++--
>  arch/x86/include/asm/kvm_host.h    |  6 ++++
>  arch/x86/kvm/x86.c                 | 51 ++++++++++++++++++++++++++++++
>  3 files changed, 65 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
> index 3aecf2a70e7b43..48f2a8ca519548 100644
> --- a/Documentation/virt/kvm/x86/msr.rst
> +++ b/Documentation/virt/kvm/x86/msr.rst
> @@ -294,8 +294,14 @@ data:
>  
>  	steal:
>  		the amount of time in which this vCPU did not run, in
> -		nanoseconds. Time during which the vcpu is idle, will not be
> -		reported as steal time.
> +		nanoseconds. This includes the time during which the host is
> +		suspended. Time during which the vcpu is idle, might not be
> +		reported as steal time. The case where the host suspends
> +		during a VM migration might not be accounted if VCPUs aren't
> +		entered post-resume, because KVM does not currently support
> +		suspend/resuming the associated metadata. A workaround would
> +		be for the VMM to ensure that the guest is entered with
> +		KVM_RUN after resuming from suspend.

Coming back to this with fresh eyes, I kinda feel like this needs an opt-in
somewhere.  E.g. a KVM capability, or maybe a guest-side steal-time feature?  Or
maybe we can squeak by with a module param based on your use case?

IIRC, there is a guest-side fix that is needed to not go completely off the rails
for large steal-time values.  I.e. enabling this blindly could negatively effect
existings guests.

The forced wait behavior introduced in v4 also gives me pause, but that should
really just be about getting the code right, i.e. shouldn't go sideways as long
as the host kernel is bug free.

Ugh, actually, yeah, that part needs a guard.  At the very least, it needs to be
conditional on steal-time being enabled.  KVM most definitely should not block
vCPUs that aren't using steal-time, as that's a complete waste and will only make
the effects of suspend worse for the guest.  At that point, having the guest
opt-in to the behavior is a pretty minor change, and it gives users a way to
opt-out if this is causing pain.

>  	preempted:
>  		indicate the vCPU who owns this struct is running or
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f5ce2c2782142b..10634bbf2f5d21 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -124,6 +124,7 @@
>  #define KVM_REQ_HV_TLB_FLUSH \
>  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> +#define KVM_REQ_WAIT_FOR_RESUME		KVM_ARCH_REQ(35)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -917,8 +918,13 @@ struct kvm_vcpu_arch {
>  
>  	struct {
>  		u8 preempted;
> +		bool host_suspended;
>  		u64 msr_val;
>  		u64 last_steal;
> +		u64 last_suspend;
> +		u64 suspend_ns;
> +		u64 last_suspend_ns;
> +		wait_queue_head_t resume_waitq;
>  		struct gfn_to_hva_cache cache;
>  	} st;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6b4ea3be66e814..327d1831dc0746 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3717,6 +3717,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>  	steal += current->sched_info.run_delay -
>  		vcpu->arch.st.last_steal;
>  	vcpu->arch.st.last_steal = current->sched_info.run_delay;
> +	steal += vcpu->arch.st.suspend_ns - vcpu->arch.st.last_suspend_ns;
> +	vcpu->arch.st.last_suspend_ns = vcpu->arch.st.suspend_ns;

Isn't this just:

	steal += vcpu->arch.st.suspend_ns;
	vcpu->arch.st.suspend_ns = 0;

or am I missing something?  I suspect you implemented the code this way to avoid
writing vcpu->arch.st.suspend_ns in this context, because you discovered that
record_steal_time() can run concurrently with kvm_arch_suspend_notifier(), i.e.
because vcpu->arch.st.suspend_ns was getting corrupted.

The above doesn't fully solve the problem, it just makes the badness less bad
and/or much less likely to be hit.  E.g. if vcpu->arch.st.suspend_ns is advanced
between the first and second loads, KVM would fail to account the delta between
the two loads.

Unless I'm missing something, the obvious/easy thing is to make arch.st.suspend_ns
and atomic64_t, e.g.

	if (unlikely(atomic64_read(&vcpu->arch.st.suspend_ns)))
		steal += atomic64_xchg(&vcpu->arch.st.suspend_ns, 0);

and then on the resume side:

	atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
	kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);


>  	unsafe_put_user(steal, &st->steal, out);
>  
>  	version += 1;
> @@ -6930,6 +6932,19 @@ long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
>  }
>  #endif
>  
> +static void wait_for_resume(struct kvm_vcpu *vcpu)
> +{
> +	wait_event_interruptible(vcpu->arch.st.resume_waitq,
> +	    vcpu->arch.st.host_suspended == 0);
> +
> +	/*
> +	 * This might happen if we blocked here before the freezing of tasks
> +	 * and we get woken up by the freezer.
> +	 */
> +	if (vcpu->arch.st.host_suspended)
> +		kvm_make_request(KVM_REQ_WAIT_FOR_RESUME, vcpu);

I most definitely don't want to add custom waiting behavior for this.  As this
code shows, ensuring a wakeup doesn't race with blocking isn't the easiest thing
in the world.

Off the top of my head, I can't think of any reason why we can't simply send the
vCPU into kvm_vcpu_block(), by treating the vCPU as completely non-runnable while
it is suspended.

>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_arch_suspend_notifier(struct kvm *kvm)
>  {
> @@ -6939,6 +6954,19 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
>  
>  	mutex_lock(&kvm->lock);
>  	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		vcpu->arch.st.last_suspend = ktime_get_boottime_ns();
> +		/*
> +		 * Tasks get thawed before the resume notifier has been called
> +		 * so we need to block vCPUs until the resume notifier has run.
> +		 * Otherwise, suspend steal time might get applied too late,
> +		 * and get accounted to the wrong guest task.
> +		 * This also ensures that the guest paused bit set below
> +		 * doesn't get checked and cleared before the host actually
> +		 * suspends.
> +		 */
> +		vcpu->arch.st.host_suspended = 1;

We can definitely avoid this flag, e.g. by zeroing last_suspend in the resume
notified, and using that to detect "host suspended".

> +		kvm_make_request(KVM_REQ_WAIT_FOR_RESUME, vcpu);
> +
>  		if (!vcpu->arch.pv_time.active)
>  			continue;
>  
> @@ -6954,12 +6982,32 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
>  	return ret ? NOTIFY_BAD : NOTIFY_DONE;
>  }
>  
> +static int kvm_arch_resume_notifier(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	unsigned long i;
> +
> +	mutex_lock(&kvm->lock);

No need for this, it provides zero protection and can (very, very theoretically)
trigger deadlock.  The lock has already been dropped from the suspend notifier.

> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		vcpu->arch.st.host_suspended = 0;
> +		vcpu->arch.st.suspend_ns += ktime_get_boottime_ns() -
> +		    vcpu->arch.st.last_suspend;
> +		wake_up_interruptible(&vcpu->arch.st.resume_waitq);

This needs a

	kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);

to ensure the suspend_ns time is accounted.  kvm_arch_vcpu_load() probably
guarantees KVM_REQ_STEAL_UPDATE is set, but KVM shouldn't rely on that.

Completely untested, and I didn't add any new ABI, but something like this?

---
 Documentation/virt/kvm/x86/msr.rst | 10 ++++--
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/x86.c                 | 56 +++++++++++++++++++++++++++++-
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 3aecf2a70e7b..48f2a8ca5195 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -294,8 +294,14 @@ data:
 
 	steal:
 		the amount of time in which this vCPU did not run, in
-		nanoseconds. Time during which the vcpu is idle, will not be
-		reported as steal time.
+		nanoseconds. This includes the time during which the host is
+		suspended. Time during which the vcpu is idle, might not be
+		reported as steal time. The case where the host suspends
+		during a VM migration might not be accounted if VCPUs aren't
+		entered post-resume, because KVM does not currently support
+		suspend/resuming the associated metadata. A workaround would
+		be for the VMM to ensure that the guest is entered with
+		KVM_RUN after resuming from suspend.
 
 	preempted:
 		indicate the vCPU who owns this struct is running or
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8becf50d9ade..8a5ff888037a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -931,6 +931,8 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
+		atomic64_t suspend_ns;
+		u64 suspend_ts;
 		struct gfn_to_hva_cache cache;
 	} st;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 73f4a85c72aa..b6120ebbb8fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3751,6 +3751,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+
+	if (unlikely(atomic64_read(&vcpu->arch.st.suspend_ns)))
+		steal += atomic64_xchg(&vcpu->arch.st.suspend_ns, 0);
+
 	unsafe_put_user(steal, &st->steal, out);
 
 	version += 1;
@@ -6992,6 +6996,7 @@ long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_arch_suspend_notifier(struct kvm *kvm)
 {
+	bool kick_vcpus = false;
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
 
@@ -6999,9 +7004,45 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
 	 * Ignore the return, marking the guest paused only "fails" if the vCPU
 	 * isn't using kvmclock; continuing on is correct and desirable.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
 		(void)kvm_set_guest_paused(vcpu);
 
+		if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED) {
+			kick_vcpus = true;
+			WRITE_ONCE(vcpu->arch.st.suspend_ts,
+				   ktime_get_boottime_ns());
+		}
+	}
+
+	if (kick_vcpus)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+
+	return NOTIFY_DONE;
+}
+
+static int kvm_arch_resume_notifier(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		u64 suspend_ns  = ktime_get_boottime_ns() -
+				  vcpu->arch.st.suspend_ts;
+
+		WRITE_ONCE(vcpu->arch.st.suspend_ts, 0);
+
+		/*
+		 * Only accumulate the suspend time if steal-time is enabled,
+		 * but always clear suspend_ts and kick the vCPU as the vCPU
+		 * could have disabled steal-time after the suspend notifier
+		 * grabbed suspend_ts.
+		 */
+		if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
+			atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
+
+		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+	}
+
 	return NOTIFY_DONE;
 }
 
@@ -7011,6 +7052,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 	case PM_HIBERNATION_PREPARE:
 	case PM_SUSPEND_PREPARE:
 		return kvm_arch_suspend_notifier(kvm);
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		return kvm_arch_resume_notifier(kvm);
 	}
 
 	return NOTIFY_DONE;
@@ -11251,6 +11295,16 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_has_events);
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * During host SUSPEND/RESUME tasks get frozen after SUSPEND notifiers
+	 * run, and thawed before RESUME notifiers, i.e. vCPUs can be actively
+	 * running when KVM sees the system as suspended.  Block the vCPU if
+	 * KVM sees the vCPU as suspended to ensure the suspend steal time is
+	 * accounted before the guest can run, and to the correct guest task.
+	 */
+	if (READ_ONCE(vcpu->arch.st.suspend_ts))
+		return false;
+
 	return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
 	       kvm_vcpu_has_events(vcpu);
 }

base-commit: 17cfb61855eafd72fd6a22d713a39be0d74660e1
--