[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ea168df6dfbe910524a381b88347636e1a6a3bc.camel@infradead.org>
Date: Fri, 13 Oct 2023 19:21:40 +0100
From: David Woodhouse <dwmw2@...radead.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Dongli Zhang <dongli.zhang@...cle.com>,
Joe Jin <joe.jin@...cle.com>, x86@...nel.org,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
pbonzini@...hat.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com
Subject: Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock
periodically
On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote:
> On Wed, Oct 11, 2023, David Woodhouse wrote:
> > On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote:
> > > On Wed, Oct 04, 2023, Dongli Zhang wrote:
> > > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> > > > > -{
> > > > > - struct kvm *kvm = v->kvm;
> > > > > -
> > > > > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> > > > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> > > > > - KVMCLOCK_UPDATE_DELAY);
> > > > > -}
> > > > > -
> > > > > #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
> > > >
> > > > While David mentioned "maximum delta", how about to turn above into a module
> > > > param with the default 300HZ.
> > > >
> > > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
> > > > or 1-day.
> > >
> > > Hmm, I think I agree with David that it would be better if KVM can take care of
> > > the gory details and promise a certain level of accuracy. I'm usually a fan of
> > > punting complexity to userspace, but requiring every userspace to figure out the
> > > ideal sync frequency on every platform is more than a bit unfriendly. And it
> > > might not even be realistic unless userspace makes assumptions about how the kernel
> > > computes CLOCK_MONOTONIC_RAW from TSC cycles.
> > >
> >
> > I think perhaps I would rather save up my persuasiveness on the topic
> > of "let's not make things too awful for userspace to cope with" for the
> > live update/migration mess. I think I need to dust off that attempt at
> > fixing our 'how to migrate with clocks intact' documentation from
> > https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/
> > The changes we're discussing here obviously have an effect on migration
> > too.
> >
> > Where the host TSC is actually reliable, I would really prefer for the
> > kvmclock to just be a fixed function of the guest TSC and *not* to be
> > arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically.
>
> CLOCK_MONOTONIC_RAW! Just wanted to clarify because if kvmclock were tied to the
> non-raw clock, then we'd have to somehow reconcile host NTP updates.
Sorry, yes. CLOCK_MONOTONIC_RAW. That was just a typo in email.
Of course we'd never try to use CLOCK_MONOTONIC; that would be daft.
We'd never do that. Just like we'd never do something daft like using
CLOCK_REALTIME instead of CLOCK_TAI for the KVM_CLOCK_REALTIME pairing
in __get_kvmclock()... oh.
Shit.
> I generally support the idea, but I think it needs to an opt-in from userspace.
> Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
> suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
> AFAICT, there are too many edge cases and assumptions about userspace for KVM to
> safely couple kvmclock to guest TSC by default.
I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value
and if vCPUs adjust themselves forward and backwards from that, it's
just handled as a delta.
And we solved 'give all vCPUS the same TSC frequency' by making that
KVM-wide.
Maybe suspending and resuming the host can be treated like live
migration, where you know the host TSC is different so you have to make
do with a delta based on CLOCK_TAI.
But while I'm picking on the edge cases and suggesting that we *can*
cope with some of them, I do agree with your suggestion that "let
kvmclock run by itself without being clamped back to
CLOCK_MONOTONIC_RAW" should be an opt *in* feature.
> > [1] Yes, I believe "back" does happen. I have test failures in my queue
> > to look at, where guests see the "Xen" clock going backwards.
>
> Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
>
> What if we add a module param to disable KVM's TSC synchronization craziness
> entirely? If we first clean up the peroidic sync mess, then it seems like it'd
> be relatively straightforward to let kill off all of the synchronization, including
> the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
>
> Not intended to be a functional patch...
Will stare harder at the actual patch when it isn't Friday night.
In the meantime, I do think a KVM cap that the VMM opts into is better
than a module param?
> ---
> arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++---
> 1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5b2104bdd99f..75fc6cbaef0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> static bool __read_mostly kvmclock_periodic_sync = true;
> module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>
> +static bool __read_mostly enable_tsc_sync = true;
> +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444);
> +
> /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
> static u32 __read_mostly tsc_tolerance_ppm = 250;
> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> bool matched = false;
> bool synchronizing = false;
>
> + if (!enable_tsc_sync) {
> + offset = kvm_compute_l1_tsc_offset(vcpu, data);
> + kvm_vcpu_write_tsc_offset(vcpu, offset);
> + return;
> + }
> +
> raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> offset = kvm_compute_l1_tsc_offset(vcpu, data);
> ns = get_kvmclock_base_ns();
> @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> &ka->master_kernel_ns,
> &ka->master_cycle_now);
>
> - ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> - && !ka->backwards_tsc_observed
> - && !ka->boot_vcpu_runs_old_kvmclock;
> + WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync);
> +
> + ka->use_master_clock = host_tsc_clocksource &&
> + (vcpus_matched || !enable_tsc_sync) &&
> + !ka->backwards_tsc_observed &&
> + !ka->boot_vcpu_runs_old_kvmclock;
>
> if (ka->use_master_clock)
> atomic_set(&kvm_guest_has_master_clock, 1);
> @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
>
> void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user)
> {
> + if (!enable_tsc_sync)
> + return;
> +
> /*
> * Doesn't need to be a spinlock, but can't be kvm->lock as this is
> * call while holding a vCPU's mutext.
> @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> if (get_user(offset, uaddr))
> break;
>
> + if (!enable_tsc_sync) {
> + kvm_vcpu_write_tsc_offset(vcpu, offset);
> + break;
> + }
> +
> raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>
> matched = (vcpu->arch.virtual_tsc_khz &&
> @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void)
> if (ret != 0)
> return ret;
>
> + if (!enable_tsc_sync)
> + return 0;
> +
> local_tsc = rdtsc();
> stable = !kvm_check_tsc_unstable();
> list_for_each_entry(kvm, &vm_list, vm_list) {
> @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
>
> static int __init kvm_x86_init(void)
> {
> + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> + enable_tsc_sync = true;
> +
> + if (!enable_tsc_sync)
> + kvmclock_periodic_sync = false;
> +
> kvm_mmu_x86_module_init();
> mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
> return 0;
>
> base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c
Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)
Powered by blists - more mailing lists