linux-kernel - Re: [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKZnT_57aPWfrfia@google.com>
Date: Wed, 20 Aug 2025 17:24:47 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Chao Gao <chao.gao@...el.com>, 
	David Woodhouse <dwmw2@...radead.org>, Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>, Tzung-Bi Shih <tzungbi@...nel.org>, 
	John Stultz <jstultz@...gle.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	ssouhlal@...ebsd.org
Subject: Re: [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend.

On Tue, Jul 22, 2025, Suleiman Souhlal wrote:
> Try to advance guest TSC to current time after suspend when the host
> TSCs went backwards.
> 
> This makes the behavior consistent between suspends where host TSC
> resets and suspends where it doesn't, such as suspend-to-idle, where
> in the former case if the host TSC resets, the guests' would
> previously be "frozen" due to KVM's backwards TSC prevention, while
> in the latter case they would advance.

Before you waste too much time reading through the various pieces of feedback...

I'm leaning towards scrapping this patch.  I still like the idea, but making it
do the right thing given all the edge cases and caveats with TSC management in
KVM seems practically impossible.  E.g. as you called out in an earlier version,
fast-forwarding to "now" is probably undesirable if a vCPU's TSC was completel
disassociated from "now" and/or arch.kvmclock_offset.

We could probably figure out ways to cobble together a solution that works for
most situations, but given that no one is clamoring for KVM to operate this way,
I doubt it'd be worth the effort and complexity.  And it certainly isn't worth
taking on the risk of breaking an existing setup/user.

For the suspend steal time angle, a PV feature bit means both the host and the
guest need to opt-in.  A host opt-in means we can punt to documentation, e.g.
we can document that there are caveats with running VMs across deep suspend, and
the user should consider whether or not they care before enable suspend steal time.

And then if someone really wants KVM to fast-forward time (or comes up with a
simple solution), they can have honor of figuring out how to do so correctly for
all of the crazy TSC flows.

> Suggested-by: Sean Christopherson <seanjc@...gle.com>
> Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ++
>  arch/x86/kvm/x86.c              | 49 ++++++++++++++++++++++++++++++++-
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fb01e456b624..e57d51e9f2be 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1415,6 +1415,9 @@ struct kvm_arch {
>  	u64 cur_tsc_offset;
>  	u64 cur_tsc_generation;
>  	int nr_vcpus_matched_tsc;
> +#ifdef CONFIG_X86_64
> +	bool host_was_suspended;

Adding an #idfef to save a single bool isn't worth it, especially since it
necessitates a wrapper (or more #ifdefs).  For something like this, I'd just
set it unconditionally, and then esentially ignore it for 32-bit, along with a
comment explaining why we can't do anything useful for 32-bit.

> +#endif
>  
>  	u32 default_tsc_khz;
>  	bool user_set_tsc;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a9d992d5652f..422c7fcc5d83 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2779,7 +2779,7 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
>  	kvm_vcpu_write_tsc_offset(vcpu, tsc_offset + adjustment);
>  }
>  
> -static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
> +static inline void __adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
>  {
>  	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio)
>  		WARN_ON(adjustment < 0);
> @@ -4995,6 +4995,52 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
>  
>  static DEFINE_PER_CPU(struct kvm_vcpu *, last_vcpu);
>  
> +#ifdef CONFIG_X86_64
> +static void kvm_set_host_was_suspended(struct kvm *kvm)
> +{
> +	kvm->arch.host_was_suspended = true;
> +}
> +
> +static void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, u64 adj)
> +{
> +	unsigned long flags;
> +	struct kvm *kvm;
> +	bool advance;
> +	u64 kernel_ns, l1_tsc, offset, tsc_now;
> +
> +	kvm = vcpu->kvm;
> +	advance = kvm_get_time_and_clockread(&kernel_ns, &tsc_now);
> +	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> +	/*
> +	 * Advance the guest's TSC to current time instead of only preventing
> +	 * it from going backwards, while making sure all the vCPUs use the
> +	 * same offset.
> +	 */
> +	if (kvm->arch.host_was_suspended && advance) {
> +		l1_tsc = nsec_to_cycles(vcpu,
> +					kvm->arch.kvmclock_offset + kernel_ns);
> +		offset = kvm_compute_l1_tsc_offset(vcpu, l1_tsc);

This is where my idea really falls apart.  For this to be correct, KVM would need
to know the relationship between a vCPU's TSC offset and kvmclock_offset.  The
simplest thing would likely be to do something like force an update if and only
if TSCs are matched across all vCPUs, but trying to reason about how that would
work when vCPUs account for the suspend time at different and arbitrary times
makes my head hurt.

One idea I might fiddle with is to snapshot get_kvmclock_base_ns() on suspend
and then grab it again on resume, and use _that_ delta for tsc_offset_adjustment.
But that's something that can be pursued separately (if at all; I don't see any
reason to gate suspend steal time on it.