linux-kernel - Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6dc1242ff731cf0f2826760816081674ade9ff9.camel@infradead.org>
Date:   Wed, 04 Oct 2023 11:01:12 +0100
From:   David Woodhouse <dwmw2@...radead.org>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Dongli Zhang <dongli.zhang@...cle.com>,
        Joe Jin <joe.jin@...cle.com>, x86@...nel.org,
        kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        pbonzini@...hat.com, tglx@...utronix.de, mingo@...hat.com,
        bp@...en8.de, dave.hansen@...ux.intel.com
Subject: Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock
 periodically

On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
> On Tue, Oct 03, 2023, David Woodhouse wrote:
> > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> > > 
> > > The two domains use the same "clock" (constant TSC), but different math to compute
> > > nanoseconds from a given TSC value.  For decently large TSC values, this results
> > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> > 
> > This is the bit I'm still confused about, and it seems to be the root
> > of all the other problems.
> > 
> > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a
> > number of ticks of the TSC running at a constant known frequency, to a
> > number of nanoseconds.
> > 
> > So how in the name of all that is holy do they manage to get
> > *different* answers?
> > 
> > I get that the mult/shift thing carries some imprecision, but is that
> > all it is? 
> 
> Yep, pretty sure that's it.  It's like the plot from Office Space / Superman III.
> Those little rounding errors add up over time.
> 
> PV clock:
> 
>   nanoseconds = ((TSC >> shift) * mult) >> 32
> 
> or 
> 
>   nanoseconds = ((TSC << shift) * mult) >> 32
> 
> versus timekeeping (mostly)
> 
>   nanoseconds = (TSC * mult) >> shift
> 
> The more I look at the PV clock stuff, the more I agree with Peter: it's garbage.
> Shifting before multiplying is guaranteed to introduce error.  Shifting right drops
> data, and shifting left introduces zeros.
> 
> > Can't we ensure that the kvmclock uses the *same* algorithm,
> > precisely, as CLOCK_MONOTONIC_RAW?
> 
> Yes?  At least for sane hardware, after much staring, I think it's possible.
> 
> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
> 
> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
> 
> Compile tested only, but I believe this math is correct.  And I'm guessing we'd
> want some safeguards against overflow, e.g. due to a multiplier that is too big.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..ae9275c3d580 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>                                             v->arch.l1_tsc_scaling_ratio);
>  
>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> -                                  &vcpu->hv_clock.tsc_shift,
> -                                  &vcpu->hv_clock.tsc_to_system_mul);
> +               u32 shift, mult;
> +
> +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
> +
> +               if (shift <= 32) {
> +                       vcpu->hv_clock.tsc_shift = 0;
> +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
> +               } else {
> +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> +                                          &vcpu->hv_clock.tsc_shift,
> +                                          &vcpu->hv_clock.tsc_to_system_mul);
> +               }
> +
>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
>                 kvm_xen_update_tsc_info(v);
>         }
> 

I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
it got mult=1655736523, shift=32 and took the 'happy' path instead of
falling back.

It still drifts about the same though, using the same test as before:
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock


I was going to facetiously suggest that perhaps the kvmclock should
have leap nanoseconds... but then realised that that's basically what
Dongli's patch is *doing*. Maybe we just need to *recognise* that, so
rather than having a user-configured period for the update, KVM could
calculate the frequency for the updates based on the rate at which the
clocks would otherwise drift, and a maximum delta? Not my favourite
option, but perhaps better than nothing? 

Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)