[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABCjUKDDDhXx8mSRKHCa34JjSX1nfM5WMG-UrPu9fjei6gkUJA@mail.gmail.com>
Date: Tue, 21 Jan 2025 14:37:50 +0900
From: Suleiman Souhlal <suleiman@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>,
David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
ssouhlal@...ebsd.org
Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Fri, Jan 17, 2025, Suleiman Souhlal wrote:
> > On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > >
> > > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > > It returns the cumulative nanoseconds that the host has been suspended.
> > > > It is intended to be used for reporting host suspend time to the guest.
> > >
> > > ...
> > >
> > > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > > > +{
> > > > + switch (state) {
> > > > + case PM_HIBERNATION_PREPARE:
> > > > + case PM_SUSPEND_PREPARE:
> > > > + last_suspend = ktime_get_boottime_ns();
> > > > + case PM_POST_HIBERNATION:
> > > > + case PM_POST_SUSPEND:
> > > > + total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
> > >
> > > After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> > > sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> > > most clocksources that will be used by KVM x86 guests.
> > >
> > > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> > > backwards due to suspend+resume. I haven't dug super deep, buy I assume/hope the
> > > majority of suspend time is handled by massaging guest TSC.
> > >
> > > There's still a notable gap, as KVM's TSC adjustments likely won't account for
> > > the lag between CPUs coming online and vCPU's being restarted, but I don't know
> > > that having KVM account the suspend duration is the right way to solve that issue.
> >
> > (It is my understanding that steal time has no impact on clock sources.)
> > On our machines, the problem isn't that the TSC is going backwards. As
> > you say, kvmclock takes care of that.
> >
> > The problem these patches are trying to solve is that the time keeps
> > advancing for the VM while the host is suspended.
>
> Right, the issue is that because KVM adjusts guest TSC if the host TSC does go
> backwards, then the accounting will be all kinds of messed up.
>
> 1. Initiate suspend at host TSC X, guest TSC X+Y.
>
> 2. Save X into last_host_tsc via kvm_arch_vcpu_put():
>
> vcpu->arch.last_host_tsc = rdtsc();
>
> 3. Resume after N hours, host TSC reset to 0 and starts counting.
>
> 4. kvm_arch_enable_virtualization_cpu() runs at new host time Z.
>
> 5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest
> TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)".
>
> u64 delta_cyc = max_tsc - local_tsc;
> list_for_each_entry(kvm, &vm_list, vm_list) {
> kvm->arch.backwards_tsc_observed = true;
> kvm_for_each_vcpu(i, vcpu, kvm) {
> vcpu->arch.tsc_offset_adjustment += delta_cyc;
> vcpu->arch.last_host_tsc = local_tsc;
> kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> }
>
> Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep
> advancing.
>
> kvmclock on the other hand counts from *host* boot, and so guest time keeps
> advancing if the guest is using kvmclock.
>
> #ifdef CONFIG_X86_64
> static s64 get_kvmclock_base_ns(void)
> {
> /* Count up from boot time, but with the frequency of the raw clock. */
> return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
> }
> #else
> static s64 get_kvmclock_base_ns(void)
> {
> /* Master clock not used, so we can just use CLOCK_BOOTTIME. */
> return ktime_get_boottime_ns();
> }
> #endif
>
> In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> Or maybe it's the other way around and effectively freezing guest TSC is super
> problematic and fundamentally flawed.
>
> Regardless of which one is "broken", unconditionally accounting suspend time to
> steal_time will do the wrong thing when sched_clock=tsc. To further muddy the
> waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> but sched_clock=kvmclock. In that scenario, guest time doesn't advanced, but
> guest scheduler time does. Ugh.
>
> That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> e.g. so that at least the behavior of time is consistent.
>
> Hmm, if freezing guest time across suspend is indeed problematic, one thought
> would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> host may be suspending. The Linux-as-a-guest would prefer kvmclock over TSC for
> both clocksource and sched_clock.
>
> [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
I see what you're saying. Thanks for explaining.
To complicate things further there are also different kinds of
suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
suspends don't stop the CPU, at least on our machines, so the host TSC
keeps ticking. On "deep" suspends, on the other hand, the TSC might go
backwards.
But I suppose if the guest uses kvmclock the behavior should be the
same in either case.
At least for our use case we would definitely want guest *wall* time
to keep advancing, so we would still want to use kvmclock.
Would accounting the suspend duration in steal time be acceptable if
it was conditional on the guest using kvmclock?
We would need a way for the host to be notified that the guest is
indeed using it, possibly by adding a new MSR to be written to in
kvm_cs_enable().
-- Suleiman
Powered by blists - more mailing lists