linux-kernel - Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z5AB-6bLRNLle27G@google.com>
Date: Tue, 21 Jan 2025 12:22:19 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>, 
	David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	ssouhlal@...ebsd.org
Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().

On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > Or maybe it's the other way around and effectively freezing guest TSC is super
> > problematic and fundamentally flawed.
> >
> > Regardless of which one is "broken", unconditionally accounting suspend time to
> > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > guest scheduler time does.  Ugh.
> >
> > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > e.g. so that at least the behavior of time is consistent.
> >
> > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > both clocksource and sched_clock.
> >
> > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> 
> I see what you're saying. Thanks for explaining.
> 
> To complicate things further there are also different kinds of
> suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> suspends don't stop the CPU, at least on our machines, so the host TSC
> keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> backwards.

Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
with nonstop TSC.

> But I suppose if the guest uses kvmclock the behavior should be the
> same in either case.
> 
> At least for our use case we would definitely want guest *wall* time
> to keep advancing, so we would still want to use kvmclock.
> 
> Would accounting the suspend duration in steal time be acceptable if
> it was conditional on the guest using kvmclock?
> We would need a way for the host to be notified that the guest is
> indeed using it,

And not just using kvmclock, but specifically using for sched_clock.  E.g. the
current behavior for most Linux guests on modern hardware is that they'll use TSC
for clocksource, but kvmclock for sched_clock and wall clock.

> possibly by adding a new MSR to be written to in
> kvm_cs_enable().

I don't think that's a good way forward.  I expect kvmclock to be largely
deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
at which point tying this to kvmclock puts us back at square one.

Given that s2idle and standby don't reset host TSC, I think the right way to
handle this conundrum is to address the flaw that's noted in the "backwards TSC"
logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.

	 * ......................................  Unfortunately, we can't
	 * bring the TSCs fully up to date with real time, as we aren't yet far
	 * enough into CPU bringup that we know how much real time has actually
	 * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
	 * variables that haven't been updated yet.

I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
KVm *needs* to effectively snapshot guest TSC when onlining CPUs.

If that wart is fixed, then both kvmclock and TSC will account host suspend time,
and KVM can safely account the suspend time into steal time regardless of which
clock(s) the guest is using.