linux-kernel - Re: [PATCH] KVM: x86: Include host suspended time in steal time.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zqi2RJKp8JxSedOI@freefall.freebsd.org>
Date: Tue, 30 Jul 2024 09:45:40 +0000
From: Suleiman Souhlal <ssouhlal@...ebsd.org>
To: Chao Gao <chao.gao@...el.com>
Cc: Suleiman Souhlal <suleiman@...gle.com>,
	Sean Christopherson <seanjc@...gle.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
	"H. Peter Anvin" <hpa@...or.com>, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: x86: Include host suspended time in steal time.

On Tue, Jul 30, 2024 at 10:26:30AM +0800, Chao Gao wrote:
> Hi,
> 
> On Wed, Jul 10, 2024 at 04:44:10PM +0900, Suleiman Souhlal wrote:
> >When the host resumes from a suspend, the guest thinks any task
> >that was running during the suspend ran for a long time, even though
> >the effective run time was much shorter, which can end up having
> >negative effects with scheduling. This can be particularly noticeable
> >if the guest task was RT, as it can end up getting throttled for a
> >long time.
> >
> >To mitigate this issue, we include the time that the host was
> >suspended in steal time, which lets the guest can subtract the
> >duration from the tasks' runtime.
> >
> >Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> >---
> > arch/x86/kvm/x86.c       | 23 ++++++++++++++++++++++-
> > include/linux/kvm_host.h |  4 ++++
> > 2 files changed, 26 insertions(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index 0763a0f72a067f..94bbdeef843863 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -3669,7 +3669,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> > 	struct kvm_steal_time __user *st;
> > 	struct kvm_memslots *slots;
> > 	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
> >-	u64 steal;
> >+	u64 steal, suspend_duration;
> > 	u32 version;
> > 
> > 	if (kvm_xen_msr_enabled(vcpu->kvm)) {
> >@@ -3696,6 +3696,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> > 			return;
> > 	}
> > 
> >+	suspend_duration = 0;
> >+	if (READ_ONCE(vcpu->suspended)) {
> >+		suspend_duration = vcpu->kvm->last_suspend_duration;
> >+		vcpu->suspended = 0;
> 
> Can you explain why READ_ONCE() is necessary here, but WRITE_ONCE() isn't used
> for clearing vcpu->suspended?

Because this part of the code is essentially single-threaded, it didn't seem
like WRITE_ONCE() was needed. I can add it in an eventual future version of
the patch if it makes things less confusing (if this code still exists).

> 
> >+	}
> >+
> > 	st = (struct kvm_steal_time __user *)ghc->hva;
> > 	/*
> > 	 * Doing a TLB flush here, on the guest's behalf, can avoid
> >@@ -3749,6 +3755,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> > 	unsafe_get_user(steal, &st->steal, out);
> > 	steal += current->sched_info.run_delay -
> > 		vcpu->arch.st.last_steal;
> >+	steal += suspend_duration;
> > 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
> > 	unsafe_put_user(steal, &st->steal, out);
> > 
> >@@ -6920,6 +6927,7 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
> > 
> > 	mutex_lock(&kvm->lock);
> > 	kvm_for_each_vcpu(i, vcpu, kvm) {
> >+		WRITE_ONCE(vcpu->suspended, 1);
> > 		if (!vcpu->arch.pv_time.active)
> > 			continue;
> > 
> >@@ -6932,15 +6940,28 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
> > 	}
> > 	mutex_unlock(&kvm->lock);
> > 
> >+	kvm->suspended_time = ktime_get_boottime_ns();
> >+
> > 	return ret ? NOTIFY_BAD : NOTIFY_DONE;
> > }
> > 
> >+static int
> >+kvm_arch_resume_notifier(struct kvm *kvm)
> >+{
> >+	kvm->last_suspend_duration = ktime_get_boottime_ns() -
> >+	    kvm->suspended_time;
> 
> Is it possible that a vCPU doesn't get any chance to run (i.e., update steal
> time) between two suspends? In this case, only the second suspend would be
> recorded.

Good point. I'll address this.

> 
> Maybe we need an infrastructure in the PM subsystem to record accumulated
> suspended time. When updating steal time, KVM can add the additional suspended
> time since the last update into steal_time (as how KVM deals with
> current->sched_info.run_deley). This way, the scenario I mentioned above won't
> be a problem and KVM needn't calculate the suspend duration for each guest. And
> this approach can potentially benefit RISC-V and ARM as well, since they have
> the same logic as x86 regarding steal_time.

Thanks for the suggestion.
I'm a bit wary of making a whole PM subsystem addition for such a counter, but
maybe I can make a architecture-independent KVM change for it, with a PM
notifier in kvm_main.c.

> 
> Additionally, it seems that if a guest migrates to another system after a suspend
> and before updating steal time, the suspended time is lost during migration. I'm
> not sure if this is a practical issue.

The systems where the host suspends don't usually do VM migrations. Or at least
the ones where we're encountering the problem this patch is trying to address
don't (laptops).
But even if they did, it doesn't seem that likely that the migration would
happen over a host suspend.
If it's ok with you, I'll put this issue aside for the time being.

Thanks for the comments.
-- Suleiman