linux-kernel - Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABCjUKDU4b5QodgT=tSgrV-fb_qnksmSxhMK3gNrUGsT9xeitg@mail.gmail.com>
Date: Fri, 17 Jan 2025 15:35:00 +0900
From: Suleiman Souhlal <suleiman@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>, 
	David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	ssouhlal@...ebsd.org
Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().

On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > It returns the cumulative nanoseconds that the host has been suspended.
> > It is intended to be used for reporting host suspend time to the guest.
>
> ...
>
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > +{
> > +     switch (state) {
> > +     case PM_HIBERNATION_PREPARE:
> > +     case PM_SUSPEND_PREPARE:
> > +             last_suspend = ktime_get_boottime_ns();
> > +     case PM_POST_HIBERNATION:
> > +     case PM_POST_SUSPEND:
> > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
>
> After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> most clocksources that will be used by KVM x86 guests.
>
> KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> majority of suspend time is handled by massaging guest TSC.
>
> There's still a notable gap, as KVM's TSC adjustments likely won't account for
> the lag between CPUs coming online and vCPU's being restarted, but I don't know
> that having KVM account the suspend duration is the right way to solve that issue.

(It is my understanding that steal time has no impact on clock sources.)
On our machines, the problem isn't that the TSC is going backwards. As
you say, kvmclock takes care of that.

The problem these patches are trying to solve is that the time keeps
advancing for the VM while the host is suspended.
On the host, this problem does not happen because timekeeping is
stopped by timekeeping_suspend().
The problem with time advancing is that the guest scheduler thinks the
task that was running when the host suspended was running for the
whole duration, which was especially bad when we still had RT
throttling (prior to v6.12), as in the case of a RT task being
current, the scheduler would then throttle all RT tasks for a very
long time. With dlserver, the problem is much less severe, but still
exists.
There is a similar problem when the host CPU is overcommitted, where
the guest scheduler thinks the current task ran for the full duration,
even though the effective running time was much lower. This is exactly
what steal time solves, which is why I thought addressing the suspend
issue with steal time was an acceptable approach.

TSC adjustments would also be a way to address the issue, but we would
then need another mechanism to still advance the guest wall time
during the host suspension.
We have tried that approach in the past, and it was working pretty
well for us, but it did not seem popular with the rest of the
community: https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/T/

There is an additional gap with both approaches, which is that time
advances when the VM is blocked because the VMM hasn't run KVM_RUN.
Ideally steal time would also include time where the VM isn't being
scheduled because of the VMM (but maybe only if the blocking is due to
something outside of the guest's control), so that the guest scheduler
doesn't punish tasks for it. But that's a completely different
discussion, and I should probably not even be mentioning that here.

I hope that helps give some background on these patches.
-- Suleiman