[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrwdSLvlhde6uaAB@google.com>
Date: Tue, 13 Aug 2024 19:58:16 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: David Woodhouse <dwmw2@...radead.org>
Cc: kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>,
Jonathan Corbet <corbet@....net>, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Paul Durrant <paul@....org>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, Shuah Khan <shuah@...nel.org>,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
jalliste@...zon.co.uk, sveith@...zon.de, zide.chen@...el.com,
Dongli Zhang <dongli.zhang@...cle.com>, Chenyi Qiang <chenyi.qiang@...el.com>
Subject: Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock()
On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@...zon.co.uk>
>
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
>
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
>
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
>
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
>
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
>
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
>
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
>
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.
This should be a separate patch. Actually, patches, plural. More below
> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
>
> Signed-off-by: David Woodhouse <dwmw@...zon.co.uk>
> ---
...
> @@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void)
> static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
> {
> struct kvm_arch *ka = &kvm->arch;
> - struct pvclock_vcpu_time_info hv_clock;
> +
> +#ifdef CONFIG_X86_64
> + uint64_t cur_tsc_khz = 0;
> + struct timespec64 ts;
>
> /* both __this_cpu_read() and rdtsc() should be on the same cpu */
> get_cpu();
>
> - data->flags = 0;
> if (ka->use_master_clock &&
> - (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
> -#ifdef CONFIG_X86_64
> - struct timespec64 ts;
> + (cur_tsc_khz = get_cpu_tsc_khz()) &&
That is mean. And if you push it inside the if-statement, the {get,put}_cpu()
can be avoided when the master clock isn't being used, e.g.
if (ka->use_master_clock) {
/*
* The RDTSC needs to happen on the same CPU whose frequency is
* used to compute kvmclock's time.
*/
get_cpu();
cur_tsc_khz = get_cpu_tsc_khz();
if (cur_tsc_khz &&
!kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
cur_tsc_khz = 0;
put_cpu();
}
However, the changelog essentially claims kvm_get_walltime_and_clockread() should
never fail when use_master_clock is enabled, which suggests a WARN is warranted.
There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.
But, I think kvm_get_walltime_and_clockread() can fail when use_master_clock is
true, i.e. I don't think a WARN is viable as it could get false positives.
Ah, this is protected by pvclock_sc, so a stale use_master_clock should result
in a retry. What if we WARN on that?
Hrm, that requires plumbing in the original sequence count. Ah, but looking at
the patch as a whole, if we keep kvm_get_wall_clock_epoch()'s style, then it's
much easier. And FWIW, I like the existing kvm_get_wall_clock_epoch() style a
lot more than the get_kvmclock() => __get_kvmclock() approach.
So, can we do this as prep patch #1?
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9c14d0f5a684..98806a59e110 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3360,9 +3360,16 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
local_tsc_khz = get_cpu_tsc_khz();
+ /*
+ * The master clock depends on the pvclock being based on TSC,
+ * so the only way kvm_get_walltime_and_clockread() can fail is
+ * if the clocksource changed and use_master_clock is stale, in
+ * which case a seqcount retry should be pending.
+ */
if (local_tsc_khz &&
- !kvm_get_walltime_and_clockread(&ts, &host_tsc))
- local_tsc_khz = 0; /* Fall back to old method */
+ !kvm_get_walltime_and_clockread(&ts, &host_tsc) &&
+ WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
+ local_tsc_khz = 0; /* Fall back to old method */
put_cpu();
And then as patch(es) 2..7 (give or take)
(2) fold __get_kvmclock() into get_kvmclock()
(3) and the same WARN on the seqcount in get_kvmclock() (but skimp on the comments)
(4) use get_kvmclock_base_ns() as the fallback in get_kvmclock(), i.e. delete
the raw rdtsc() and setting of KVM_CLOCK_TSC_STABLE w/o KVM_CLOCK_REALTIME
(5) use get_cpu_tsc_khz() instead of open coding something similar
(6) scale TSC when computing kvmclock (the core of this patch)
(7) use get_kvmclock() in kvm_get_wall_clock_epoch() as the will be 100%
equivalent at this point.
> + !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
> + cur_tsc_khz = 0;
>
> - if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> - data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> - data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> - } else
> -#endif
> - data->host_tsc = rdtsc();
> -
> - data->flags |= KVM_CLOCK_TSC_STABLE;
> - hv_clock.tsc_timestamp = ka->master_cycle_now;
> - hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> - kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
> - &hv_clock.tsc_shift,
> - &hv_clock.tsc_to_system_mul);
> - data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> - } else {
> - data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> + put_cpu();
> +
> + if (cur_tsc_khz) {
> + uint64_t tsc_cycles;
> + uint32_t mul;
> + int8_t shift;
> +
> + tsc_cycles = data->host_tsc - ka->master_cycle_now;
> +
> + if (kvm_caps.has_tsc_control)
> + tsc_cycles = kvm_scale_tsc(tsc_cycles,
> + ka->master_tsc_scaling_ratio);
> +
> + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + mul = ka->master_tsc_mul;
> + shift = ka->master_tsc_shift;
> + } else {
> + kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
> + &shift, &mul);
> + }
> + data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
> + pvclock_scale_delta(tsc_cycles, mul, shift);
> + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> + data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
> + return;
> }
> +#endif
>
> - put_cpu();
> + data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> + data->flags = 0;
> }
Powered by blists - more mailing lists