linux-kernel - Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z4qK4B6taSoZTJMp@google.com>
Date: Fri, 17 Jan 2025 08:52:48 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>, 
	David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	ssouhlal@...ebsd.org
Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().

On Fri, Jan 17, 2025, Suleiman Souhlal wrote:
> On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@...gle.com> wrote:
> >
> > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > It returns the cumulative nanoseconds that the host has been suspended.
> > > It is intended to be used for reporting host suspend time to the guest.
> >
> > ...
> >
> > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > > +{
> > > +     switch (state) {
> > > +     case PM_HIBERNATION_PREPARE:
> > > +     case PM_SUSPEND_PREPARE:
> > > +             last_suspend = ktime_get_boottime_ns();
> > > +     case PM_POST_HIBERNATION:
> > > +     case PM_POST_SUSPEND:
> > > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
> >
> > After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> > sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> > most clocksources that will be used by KVM x86 guests.
> >
> > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> > backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> > majority of suspend time is handled by massaging guest TSC.
> >
> > There's still a notable gap, as KVM's TSC adjustments likely won't account for
> > the lag between CPUs coming online and vCPU's being restarted, but I don't know
> > that having KVM account the suspend duration is the right way to solve that issue.
> 
> (It is my understanding that steal time has no impact on clock sources.)
> On our machines, the problem isn't that the TSC is going backwards. As
> you say, kvmclock takes care of that.
> 
> The problem these patches are trying to solve is that the time keeps
> advancing for the VM while the host is suspended.

Right, the issue is that because KVM adjusts guest TSC if the host TSC does go
backwards, then the accounting will be all kinds of messed up.

  1. Initiate suspend at host TSC X, guest TSC X+Y.

  2. Save X into last_host_tsc via kvm_arch_vcpu_put():

	vcpu->arch.last_host_tsc = rdtsc();

  3. Resume after N hours, host TSC reset to 0 and starts counting.

  4. kvm_arch_enable_virtualization_cpu() runs at new host time Z.

  5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest
     TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)".

		u64 delta_cyc = max_tsc - local_tsc;
		list_for_each_entry(kvm, &vm_list, vm_list) {
			kvm->arch.backwards_tsc_observed = true;
			kvm_for_each_vcpu(i, vcpu, kvm) {
				vcpu->arch.tsc_offset_adjustment += delta_cyc;
				vcpu->arch.last_host_tsc = local_tsc;
				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
			}

Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep
advancing.

kvmclock on the other hand counts from *host* boot, and so guest time keeps
advancing if the guest is using kvmclock.

  #ifdef CONFIG_X86_64
  static s64 get_kvmclock_base_ns(void)
  {
	/* Count up from boot time, but with the frequency of the raw clock.  */
	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
  }
  #else
  static s64 get_kvmclock_base_ns(void)
  {
	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
	return ktime_get_boottime_ns();
  }
  #endif

In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
Or maybe it's the other way around and effectively freezing guest TSC is super
problematic and fundamentally flawed.

Regardless of which one is "broken", unconditionally accounting suspend time to
steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
guest scheduler time does.  Ugh.

That particular wart can be avoided by having the guest use TSC for sched_clock[*],
e.g. so that at least the behavior of time is consistent.

Hmm, if freezing guest time across suspend is indeed problematic, one thought
would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
both clocksource and sched_clock.

[*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com