linux-kernel - Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrwdSLvlhde6uaAB@google.com>
Date: Tue, 13 Aug 2024 19:58:16 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: David Woodhouse <dwmw2@...radead.org>
Cc: kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>, 
	Jonathan Corbet <corbet@....net>, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, 
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Paul Durrant <paul@....org>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, Shuah Khan <shuah@...nel.org>, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, 
	jalliste@...zon.co.uk, sveith@...zon.de, zide.chen@...el.com, 
	Dongli Zhang <dongli.zhang@...cle.com>, Chenyi Qiang <chenyi.qiang@...el.com>
Subject: Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock()

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@...zon.co.uk>
> 
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
> 
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
> 
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
> 
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
> 
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
> 
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
> 
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
> 
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.

This should be a separate patch.  Actually, patches, plural.  More below

> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
> 
> Signed-off-by: David Woodhouse <dwmw@...zon.co.uk>
> ---

...

> @@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void)
>  static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
> -	struct pvclock_vcpu_time_info hv_clock;
> +
> +#ifdef CONFIG_X86_64
> +	uint64_t cur_tsc_khz = 0;
> +	struct timespec64 ts;
>  
>  	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
>  	get_cpu();
>  
> -	data->flags = 0;
>  	if (ka->use_master_clock &&
> -	    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
> -#ifdef CONFIG_X86_64
> -		struct timespec64 ts;
> +	    (cur_tsc_khz = get_cpu_tsc_khz()) &&

That is mean.  And if you push it inside the if-statement, the {get,put}_cpu()
can be avoided when the master clock isn't being used, e.g.

	if (ka->use_master_clock) {
		/*
		 * The RDTSC needs to happen on the same CPU whose frequency is
		 * used to compute kvmclock's time.
		 */
		get_cpu();
    
    		cur_tsc_khz = get_cpu_tsc_khz();
		if (cur_tsc_khz &&
	    	    !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
			cur_tsc_khz = 0;

		put_cpu();
	}

However, the changelog essentially claims kvm_get_walltime_and_clockread() should
never fail when use_master_clock is enabled, which suggests a WARN is warranted.

    There was previously a code path in __get_kvmclock() which looked like
    it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
    even on 32-bit hosts. In practice that could never happen as the
    ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
    hosts it would never be set when the system clock isn't TSC-based. So
    that code path is now removed.

But, I think kvm_get_walltime_and_clockread() can fail when use_master_clock is
true, i.e. I don't think a WARN is viable as it could get false positives.

Ah, this is protected by pvclock_sc, so a stale use_master_clock should result
in a retry.  What if we WARN on that?

Hrm, that requires plumbing in the original sequence count.  Ah, but looking at
the patch as a whole, if we keep kvm_get_wall_clock_epoch()'s style, then it's
much easier.  And FWIW, I like the existing kvm_get_wall_clock_epoch() style a
lot more than the get_kvmclock() => __get_kvmclock() approach.

So, can we do this as prep patch #1?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9c14d0f5a684..98806a59e110 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3360,9 +3360,16 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 
                local_tsc_khz = get_cpu_tsc_khz();
 
+               /*
+                * The master clock depends on the pvclock being based on TSC,
+                * so the only way kvm_get_walltime_and_clockread() can fail is
+                * if the clocksource changed and use_master_clock is stale, in
+                * which case a seqcount retry should be pending.
+                */
                if (local_tsc_khz &&
-                   !kvm_get_walltime_and_clockread(&ts, &host_tsc))
-                       local_tsc_khz = 0; /* Fall back to old method */
+                   !kvm_get_walltime_and_clockread(&ts, &host_tsc) &&
+                   WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
+                           local_tsc_khz = 0; /* Fall back to old method */
 
                put_cpu();
 

And then as patch(es) 2..7 (give or take)

  (2) fold __get_kvmclock() into get_kvmclock()
  (3) and the same WARN on the seqcount in get_kvmclock() (but skimp on the comments)
  (4) use get_kvmclock_base_ns() as the fallback in get_kvmclock(), i.e. delete
      the raw rdtsc() and setting of KVM_CLOCK_TSC_STABLE w/o KVM_CLOCK_REALTIME
  (5) use get_cpu_tsc_khz() instead of open coding something similar
  (6) scale TSC when computing kvmclock (the core of this patch)
  (7) use get_kvmclock() in kvm_get_wall_clock_epoch() as the will be 100%
      equivalent at this point.

> +	    !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
> +		cur_tsc_khz = 0;
>  
> -		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> -			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> -			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> -		} else
> -#endif
> -		data->host_tsc = rdtsc();
> -
> -		data->flags |= KVM_CLOCK_TSC_STABLE;
> -		hv_clock.tsc_timestamp = ka->master_cycle_now;
> -		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> -		kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
> -				   &hv_clock.tsc_shift,
> -				   &hv_clock.tsc_to_system_mul);
> -		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> -	} else {
> -		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> +	put_cpu();
> +
> +	if (cur_tsc_khz) {
> +		uint64_t tsc_cycles;
> +		uint32_t mul;
> +		int8_t shift;
> +
> +		tsc_cycles = data->host_tsc - ka->master_cycle_now;
> +
> +		if (kvm_caps.has_tsc_control)
> +			tsc_cycles = kvm_scale_tsc(tsc_cycles,
> +						   ka->master_tsc_scaling_ratio);
> +
> +		if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> +			mul = ka->master_tsc_mul;
> +			shift = ka->master_tsc_shift;
> +		} else {
> +			kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
> +					   &shift, &mul);
> +		}
> +		data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
> +			pvclock_scale_delta(tsc_cycles, mul, shift);
> +		data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> +		data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
> +		return;
>  	}
> +#endif
>  
> -	put_cpu();
> +	data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> +	data->flags = 0;
>  }