linux-kernel - Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <34057852-f6c0-d6d5-261f-bbb5fa056425@oracle.com>
Date:   Wed, 4 Oct 2023 12:13:48 -0700
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        David Woodhouse <dwmw2@...radead.org>
Cc:     Joe Jin <joe.jin@...cle.com>, x86@...nel.org, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com
Subject: Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock
 periodically

Hi Sean,

On 10/4/23 11:06 AM, Sean Christopherson wrote:
> On Wed, Oct 04, 2023, David Woodhouse wrote:
>> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
>>>> Can't we ensure that the kvmclock uses the *same* algorithm,
>>>> precisely, as CLOCK_MONOTONIC_RAW?
>>>
>>> Yes?  At least for sane hardware, after much staring, I think it's possible.
>>>
>>> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
>>> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
>>> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
>>>
>>> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
>>>
>>> Compile tested only, but I believe this math is correct.  And I'm guessing we'd
>>> want some safeguards against overflow, e.g. due to a multiplier that is too big.
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 6573c89c35a9..ae9275c3d580 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>>                                             v->arch.l1_tsc_scaling_ratio);
>>>  
>>>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>>> -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
>>> -                                  &vcpu->hv_clock.tsc_shift,
>>> -                                  &vcpu->hv_clock.tsc_to_system_mul);
>>> +               u32 shift, mult;
>>> +
>>> +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
>>> +
>>> +               if (shift <= 32) {
>>> +                       vcpu->hv_clock.tsc_shift = 0;
>>> +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
>>> +               } else {
>>> +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
>>> +                                          &vcpu->hv_clock.tsc_shift,
>>> +                                          &vcpu->hv_clock.tsc_to_system_mul);
>>> +               }
>>> +
>>>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
>>>                 kvm_xen_update_tsc_info(v);
>>>         }
>>>
>>
>> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
>> it got mult=1655736523, shift=32 and took the 'happy' path instead of
>> falling back.
>>
>> It still drifts about the same though, using the same test as before:
>> https://urldefense.com/v3/__https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock__;!!ACWV5N9M2RV99hQ!KEdDuRZIThXoz2zaZd3O9rk77ywSaHCQ92fTnc7PFP81bdOhTMvudMBReIfZcrm9AITeKw4kyMTmbPbJuA$ 
>>
>>
>> I was going to facetiously suggest that perhaps the kvmclock should
>> have leap nanoseconds... but then realised that that's basically what
>> Dongli's patch is *doing*. Maybe we just need to *recognise* that,
> 
> Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's
> monotonic raw clock is a fool's errand.
> 
>> so rather than having a user-configured period for the update, KVM could
>> calculate the frequency for the updates based on the rate at which the clocks
>> would otherwise drift, and a maximum delta? Not my favourite option, but
>> perhaps better than nothing? 
> 
> Holy moly, the existing code for the periodic syncs/updates is a mess.  If I'm
> reading the code correctly, commits
> 
>   0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
>   7e44e4495a39 ("x86: kvm: rate-limit global clock updates")
>   332967a3eac0 ("x86: kvm: introduce periodic global clock updates")
> 
> splattered together an immpressively inefficient update mechanism.
> 
> On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate
> of 300hz.
> 
> 	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
> 		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> 						KVMCLOCK_SYNC_PERIOD);
> 
> That handler does two things: schedule "delayed" work kvmclock_update_fn() to
> be executed immediately, and reschedule kvmclock_sync_fn() at 300hz.
> kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU
> to sync kvmlock at a 300hz frequency.  
> 
> If we're going to kick every vCPU, then we might as well do a masterclock update,
> because the extra cost of synchronizing the masterclock is likely in the noise
> compared to kicking every vCPU.  There's also zero reason to do the work in vCPU
> context.
> 
> And because that's not enough, on pCPU migration or if the TSC is unstable,
> kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
> kvmclock_update_fn() with a delay of 100ms.  The large delay is to play nice with
> unstable TSCs.  But if KVM is periodically doing clock updates on all vCPU,
> scheduling another update with a *longer* delay is silly.

We may need to add above message to the places, where
KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch?

This helps understand why KVM_REQ_CLOCK_UPDATE is sometime enough.

> 
> The really, really stupid part of all is that the periodic syncs happen even if
> kvmclock isn't exposed to the guest.  *sigh*
> 
> So rather than add yet another periodic work function, I think we should clean up
> the mess we have, fix the whole "leapseconds" mess with the masterclock, and then
> tune the frequency (if necessary).
> 
> Something like the below is what I'm thinking.  Once the dust settles, I'd like
> to do dynamically enable/disable kvmclock_sync_work based on whether or not the
> VM actually has vCPU's with a pvclock, but that's definitely an enhancement that
> can go on top.
> 
> Does this look sane, or am I missing something?
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +-
>  arch/x86/kvm/x86.c              | 53 +++++++++++----------------------
>  2 files changed, 19 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 34a64527654c..d108452fc301 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -98,7 +98,7 @@
>  	KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_SCAN_IOAPIC \
>  	KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> -#define KVM_REQ_GLOBAL_CLOCK_UPDATE	KVM_ARCH_REQ(16)
> +/* AVAILABLE BIT!!!!			KVM_ARCH_REQ(16) */
>  #define KVM_REQ_APIC_PAGE_RELOAD \
>  	KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_HV_CRASH		KVM_ARCH_REQ(18)
> @@ -1336,7 +1336,6 @@ struct kvm_arch {
>  	bool use_master_clock;
>  	u64 master_kernel_ns;
>  	u64 master_cycle_now;
> -	struct delayed_work kvmclock_update_work;
>  	struct delayed_work kvmclock_sync_work;
>  
>  	struct kvm_xen_hvm_config xen_hvm_config;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..5d35724f1963 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
>  	}
>  
>  	vcpu->arch.time = system_time;
> -	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
> +	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

As mentioned above, we may need a comment here to explain why
KVM_REQ_CLOCK_UPDATE on the only vcpu is enough.

>  
>  	/* we verify if the enable bit is set... */
>  	if (system_time & 1)
> @@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  
>  #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
>  
> -static void kvmclock_update_fn(struct work_struct *work)
> -{
> -	unsigned long i;
> -	struct delayed_work *dwork = to_delayed_work(work);
> -	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> -					   kvmclock_update_work);
> -	struct kvm *kvm = container_of(ka, struct kvm, arch);
> -	struct kvm_vcpu *vcpu;
> -
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> -		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> -		kvm_vcpu_kick(vcpu);
> -	}
> -}
> -
> -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> -{
> -	struct kvm *kvm = v->kvm;
> -
> -	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> -	schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> -					KVMCLOCK_UPDATE_DELAY);
> -}
> -
>  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)

While David mentioned "maximum delta", how about to turn above into a module
param with the default 300HZ.

BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
or 1-day.

>  
>  static void kvmclock_sync_fn(struct work_struct *work)
> @@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  					   kvmclock_sync_work);
>  	struct kvm *kvm = container_of(ka, struct kvm, arch);
>  
> -	if (!kvmclock_periodic_sync)
> -		return;
> +	if (ka->use_master_clock)
> +		kvm_update_masterclock(kvm);

Based on the source code, I think it is safe to call kvm_update_masterclock() here.

We want the masterclock to update only once. To call KVM_REQ_MASTERCLOCK_UPDATE
for each vCPU here is meaningless.

I just want to remind that this is shared workqueue. The workqueue stall
detection may report false positive (e.g., due to tsc_write_lock contention.
That should not be lock contensive).


Thank you very much!

Dongli Zhang

> +	else
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
>  
> -	schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> -	schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> -					KVMCLOCK_SYNC_PERIOD);
> +	if (kvmclock_periodic_sync)
> +		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> +				      KVMCLOCK_SYNC_PERIOD);
>  }
>  
>  /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
> @@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		 * kvmclock on vcpu->cpu migration
>  		 */
>  		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
> -			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  		if (vcpu->cpu != cpu)
>  			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
>  		vcpu->cpu = cpu;
> @@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			__kvm_migrate_timers(vcpu);
>  		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
>  			kvm_update_masterclock(vcpu->kvm);
> -		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
> -			kvm_gen_kvmclock_update(vcpu);
>  		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
>  			r = kvm_guest_time_update(vcpu);
>  			if (unlikely(r))
>  				goto out;
> +
> +			/*
> +			 * Ensure all other vCPUs synchronize "soon", e.g. so
> +			 * that all vCPUs recognize NTP corrections and drift
> +			 * corrections (relative to the kernel's raw clock).
> +			 */
> +			if (!kvmclock_periodic_sync)
> +				schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work,
> +						      KVMCLOCK_UPDATE_DELAY);
>  		}
>  		if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
>  			kvm_mmu_sync_roots(vcpu);
> @@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	kvm->arch.hv_root_tdp = INVALID_PAGE;
>  #endif
>  
> -	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
>  
>  	kvm_apicv_init(kvm);
> @@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
>  void kvm_arch_sync_events(struct kvm *kvm)
>  {
>  	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
> -	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
>  	kvm_free_pit(kvm);
>  }
>  
> 
> base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d