linux-kernel - Re: [PATCH] KVM: x86: Include host suspended time in steal time.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZqhPVnmD7XwFPHtW@chao-email>
Date: Tue, 30 Jul 2024 10:26:30 +0800
From: Chao Gao <chao.gao@...el.com>
To: Suleiman Souhlal <suleiman@...gle.com>
CC: Sean Christopherson <seanjc@...gle.com>, Paolo Bonzini
	<pbonzini@...hat.com>, <ssouhlal@...ebsd.org>, Thomas Gleixner
	<tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov
	<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, <x86@...nel.org>,
	"H. Peter Anvin" <hpa@...or.com>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: x86: Include host suspended time in steal time.

Hi,

On Wed, Jul 10, 2024 at 04:44:10PM +0900, Suleiman Souhlal wrote:
>When the host resumes from a suspend, the guest thinks any task
>that was running during the suspend ran for a long time, even though
>the effective run time was much shorter, which can end up having
>negative effects with scheduling. This can be particularly noticeable
>if the guest task was RT, as it can end up getting throttled for a
>long time.
>
>To mitigate this issue, we include the time that the host was
>suspended in steal time, which lets the guest can subtract the
>duration from the tasks' runtime.
>
>Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
>---
> arch/x86/kvm/x86.c       | 23 ++++++++++++++++++++++-
> include/linux/kvm_host.h |  4 ++++
> 2 files changed, 26 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 0763a0f72a067f..94bbdeef843863 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -3669,7 +3669,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> 	struct kvm_steal_time __user *st;
> 	struct kvm_memslots *slots;
> 	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
>-	u64 steal;
>+	u64 steal, suspend_duration;
> 	u32 version;
> 
> 	if (kvm_xen_msr_enabled(vcpu->kvm)) {
>@@ -3696,6 +3696,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> 			return;
> 	}
> 
>+	suspend_duration = 0;
>+	if (READ_ONCE(vcpu->suspended)) {
>+		suspend_duration = vcpu->kvm->last_suspend_duration;
>+		vcpu->suspended = 0;

Can you explain why READ_ONCE() is necessary here, but WRITE_ONCE() isn't used
for clearing vcpu->suspended?

>+	}
>+
> 	st = (struct kvm_steal_time __user *)ghc->hva;
> 	/*
> 	 * Doing a TLB flush here, on the guest's behalf, can avoid
>@@ -3749,6 +3755,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> 	unsafe_get_user(steal, &st->steal, out);
> 	steal += current->sched_info.run_delay -
> 		vcpu->arch.st.last_steal;
>+	steal += suspend_duration;
> 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
> 	unsafe_put_user(steal, &st->steal, out);
> 
>@@ -6920,6 +6927,7 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
> 
> 	mutex_lock(&kvm->lock);
> 	kvm_for_each_vcpu(i, vcpu, kvm) {
>+		WRITE_ONCE(vcpu->suspended, 1);
> 		if (!vcpu->arch.pv_time.active)
> 			continue;
> 
>@@ -6932,15 +6940,28 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
> 	}
> 	mutex_unlock(&kvm->lock);
> 
>+	kvm->suspended_time = ktime_get_boottime_ns();
>+
> 	return ret ? NOTIFY_BAD : NOTIFY_DONE;
> }
> 
>+static int
>+kvm_arch_resume_notifier(struct kvm *kvm)
>+{
>+	kvm->last_suspend_duration = ktime_get_boottime_ns() -
>+	    kvm->suspended_time;

Is it possible that a vCPU doesn't get any chance to run (i.e., update steal
time) between two suspends? In this case, only the second suspend would be
recorded.

Maybe we need an infrastructure in the PM subsystem to record accumulated
suspended time. When updating steal time, KVM can add the additional suspended
time since the last update into steal_time (as how KVM deals with
current->sched_info.run_deley). This way, the scenario I mentioned above won't
be a problem and KVM needn't calculate the suspend duration for each guest. And
this approach can potentially benefit RISC-V and ARM as well, since they have
the same logic as x86 regarding steal_time.

Additionally, it seems that if a guest migrates to another system after a suspend
and before updating steal time, the suspended time is lost during migration. I'm
not sure if this is a practical issue.