linux-kernel - Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0bb540291542fd37b2d0f6d39ebf1c9f35fa1ac3.camel@infradead.org>
Date: Mon, 22 Apr 2024 17:44:31 +0100
From: David Woodhouse <dwmw2@...radead.org>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: kvm@...r.kernel.org, Jonathan Corbet <corbet@....net>, Sean
 Christopherson <seanjc@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
 <dave.hansen@...ux.intel.com>, x86@...nel.org, "H. Peter Anvin"
 <hpa@...or.com>,  Paul Durrant <paul@....org>, Shuah Khan
 <shuah@...nel.org>, linux-doc@...r.kernel.org, 
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org, Oliver Upton
 <oliver.upton@...ux.dev>, Marcelo Tosatti <mtosatti@...hat.com>, 
 jalliste@...zon.co.uk, sveith@...zon.de
Subject: Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC
 scaling is in force

On Mon, 2024-04-22 at 17:54 +0200, Paolo Bonzini wrote:
> On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@...radead.org> wrote:
> 
> > > ... especially considering that you did use a 64-bit integer here
> > > (though---please use u64 not uint64_t; and BTW if you want to add a
> > > patch to change kvm_get_time_scale() to u64, please do.
> > 
> > Meh, I'm used to programming in C. Yes, I *am* old enough to have been
> > doing this since the last decade of the 1900s, but it *has* been a long
> > time since 1999, and my fingers have learned :)
> 
> Oh, I am on the same page (working on both QEMU and Linux, adapting my
> muscle memory to the context sucks) but u64/s64 is the preferred
> spelling and I have been asked to use them before.

Ever heard of Jury Nullification... ? :)

> > Heh, looks like it was you who made it uint64_t, in 2016. In a commit
> > (3ae13faac) which said "Prepare for improving the precision in the next
> > patch"... which never came, AFAICT?
> 
> Yes, it was posted as
> https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/
> but not committed.

Ah, thanks. So that isn't about arithmetic precision, but about dealing
with the mess we had where the KVM clock was afflicted by NTP skew.

We live in a much saner world now where it's simply based on the guest
TSC (or, in pathological cases, the host's CLOCK_MONOTONIC_RAW. 

> As an aside, we discovered later that the patch you list as "Fixes"
> fixed another tricky bug: before, kvmclock could jump if the TSC is
> set within the 250 ppm tolerance that does not activate TSC scaling.
> This is possible after a first live migration, and then the second
> live migration used the guest TSC frequency *that userspace desired*
> instead of the *actual* TSC frequency.

Given our saner world where the KVM clock now *isn't* adjusted for NTP
skew, that "bug" was probably better left unfixed. In fact, I may give
some thought to reverting commit 78db6a503796 completely.

Perhaps we should call that "definition E". I think we're up to five
now? But no, let's not add historical ones to the taxonomy :)

I believe Jack's KVM_SET_CLOCK_GUEST fixes the fundamental issue there
of the clock jumping on migration. It's just a special case of the
breakage in KVM_REQ_MASTERCLOCK_UPDATE, where the KVM clock has happily
been running as a direct function of the guest TSC, and when we yank it
back to some other definition when we create a new masterclock
reference point.

With KVM_SET_CLOCK_GUEST, the KVM clock is restored as a function of
the guest TSC, rather than based on realtime/CLOCK_MONOTONIC_RAW/etc.

So even though a new masterclock reference is taken in the new kernel
(and the scaling factors in the pvclock may be slightly different, as
we discussed in the comment you responded to), the ka->kvmclock_offset
is adjusted so that the value of the KVM clock at *that* moment when
the new reference is taken, is precisely what it would have been under
the old kvmclock regime for the contemporary guest TSC value.




> Before:
> 
>         this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
>         if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>                 tgt_tsc_khz = vcpu->virtual_tsc_khz;
>                 kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
>                         &vcpu->hv_clock.tsc_shift,
>                         &vcpu->hv_clock.tsc_to_system_mul);
>                 vcpu->hw_tsc_khz = this_tsc_khz;
> 
> After:
> 
>         tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);
> 
>         // tgt_tsc_khz unchanged because TSC scaling was not enabled
>         tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
> 
>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>                 kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
>                         &vcpu->hv_clock.tsc_shift,
>                         &vcpu->hv_clock.tsc_to_system_mul);
>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
> 
> So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in
> the second case it uses __this_cpu_read(cpu_tsc_khz).
> 
> This then caused a mismatch between the actual guest frequency and
> what is used by kvm_guest_time_update, which only becomes visible when
> migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK
> returns what _should have been_ the same value read by the guest, but
> it's wrong.

Hm? Until I fixed it in this series, KVM_GET_CLOCK didn't even *try*
scaling via the guest's TSC frequency, did it? It just converted from
the *host* TSC to nanoseconds (since master_kernel_now) directly. Which
was even *more* broken :)


Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)