linux-kernel - Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove timer drift correction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c12959cf6ca372569b4df10b2f9e272db1114ad1.camel@infradead.org>
Date: Tue, 09 Apr 2024 05:23:19 +0100
From: David Woodhouse <dwmw2@...radead.org>
To: Dongli Zhang <dongli.zhang@...cle.com>, Jack Allister
 <jalliste@...zon.com>,  Paolo Bonzini <pbonzini@...hat.com>, Jonathan
 Corbet <corbet@....net>, Sean Christopherson <seanjc@...gle.com>,  Thomas
 Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav
 Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,  Shuah Khan
 <shuah@...nel.org>
Cc: Paul Durrant <paul@....org>, kvm@...r.kernel.org,
 linux-doc@...r.kernel.org,  linux-kernel@...r.kernel.org,
 linux-kselftest@...r.kernel.org
Subject: Re: [PATCH 2/2] KVM: selftests: Add KVM/PV clock selftest to prove
 timer drift correction

On Mon, 2024-04-08 at 17:43 -0700, Dongli Zhang wrote:
> Hi Jack,
> 
> On 4/8/24 15:07, Jack Allister wrote:
> > This test proves that there is an inherent KVM/PV clock drift away from the
> > guest TSC when KVM decides to update the PV time information structure due
> > to a KVM_REQ_MASTERCLOCK_UPDATE. This drift is exascerbated when a guest is
> 
> Typo: exacerbated
> 
> > using TSC scaling and running at a different frequency to the host TSC [1].
> > It also proves that KVM_[GS]ET_CLOCK_GUEST API is working to mitigate the
> > drift from TSC to within ±1ns.

No, the KVM_[GS}ET_CLOCK_GUEST API is not about mitigating kernel bugs.

We *fix* kernel bugs, we don't make userspace work around them.

The KVM_[GS}ET_CLOCK_GUEST API allows userspace to perform accurate
live update and live migration without disrupting the relationship
between guest TSC and KVM clock.

Since a bombing run on KVM_REQ_MASTERCLOCK_UPDATE users is on my TODO
list, it's worth noting the *reason* that switching to the obsolete
MSR_KVM_SYSTEM_TIME forces ka->use_master_clock mode off. 

It's not documented at all as far as I can tell, but in commit
54750f2cf042 (“KVM: x86: workaround SuSE's 2.6.16 pvclock vs
masterclock issue“) in 2015, it was done to work around a guest bug
where the guest *expected* the reference point to keep being updated
and never be too far in the past.


> Is configure_scaled_tsc() anecessary? Or how about to make it an
> option/arg?
> Then I will be able to test it on a VM/server without TSC scaling.

As discussed, TSC scaling shouldn't be needed. It *should* be part of
the unit test if possible, because there is a class of bugs it'll
trigger, but it should be optional.

In particular, the existing KVM_GET_CLOCK will return extra-wrong
results if TSC scaling is in force. But that isn't being tested here
yet because we haven't *fixed* it yet :)

For reference, here's my TODO list which Jack is working from...

 • Add KVM unit test to validate that KVM clock does not change when
   provoked (including by simulated live update).
   It’s OK for the reference point at { tsc_timestamp, system_time } in
   the pvclock structure to change, but only such that it gives the
   same results for a given guest TSC — that is, if system_time
   changes, then tsc_timestamp must change by a delta which precisely
   corresponds in terms of the advertised guest TSC frequency. Perhaps
    allow a slop of 1ns for rounding, but no more.

 • Audit and fix (i.e. remove) KVM_REQ_MASTERCLOCK_UPDATE usage,
   starting with kvm_xen_shared_info_init(). And work out whether it
   should be sent to all vCPUs, as some call sites do, or just one?

 • Add KVM_VCPU_TSC_SCALE attribute to allow userspace to know the
   precise host→guest TSC scaling.
   (cf. https://lore.kernel.org/all/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/)

 • Expose guest’s view of KVM clock to userspace via KVM_GET_CLOCK_GUEST
   ioctl. Perhaps also a memory-mapped version, as the gfn_to_pfn_cache
   allows writing to userspace HVAs. With this, userspace has fast and
   accurate way to calculate the KVM clock at any given moment in time.
   (Currently, userspace calls the KVM_GET_CLOCK ioctl which is slow
   and returns inaccurate results). Then userspace can base other
   things like PIT and HPET emulation on the KVM clock and simplify
   timekeeping over migration for those too.

 • Add a KVM_SET_CLOCK_GUEST ioctl which consumes the pvclock
   information back again. This should not only set the kvmclock_offset
   field, but also set the reference point { master_cycle_now,
   master_kernel_ns } as follows:
     • Sample the kernel’s CLOCK_MONOTONIC_RAW to create a new
       master_kernel_ns and master_cycle_now.
     • Convert the new master_cycle_now to a guest TSC.
     • Calculate the intended KVM clock with that guest TSC from the
       provided pvclock information.
     • Calculate the current KVM clock with that guest TSC using the
       new master_cycle_now and master_kernel_ns and kvmclock_offset as
       usual.
     • Adjust kvmclock_offset to correct for the delta between current
       and intended values.
     • Raise KVM_REQ_CLOCK_UPDATE on all vCPUs.

 • Fix the broken __get_kvmclock() function to scale via the guest’s
   TSC frequency as it should. There isn’t necessarily a vCPU to use
   for this, so it’s OK for this to work only when the frequency has
   been set of the whole VM rather than only for individual vCPUs.
   Likewise kvm_get_wall_clock_epoch() which has the same bug.

 • Fix all other cases where KVM reads the time in two places
   separately and then treats them as simultaneous.

 • Fix the discontinuities in KVM_REQ_MASTERCLOCK_UPDATE by allowing
   kvmclock_offset to vary while the VM is running in master clock
   mode. Perhaps every call to pvclock_update_vm_gtod_copy() which
   starts in master clock mode should follow the same process as the
   proposed KVM_SET_CLOCK_GUEST to adjust the kvmclock_offset value
   which corresponds with the new reference point. As long as we don’t
   break in the case where something weird (host hibernation, etc.)
   happened to the TSC, and we actually want to trust kvmclock_offset.
   Maybe we should have a periodic work queue which keeps
   kvmclock_offset in sync with the KVM clock while the VM is in master
   clock mode? 

 • Correct the KVM documentation for TSC migration to take TSC
scaling
   into account. Something like...

   (SOURCE)
    • Sample both TAI and the (source) host TSC at an arbitrary time we
      shall call Tsrc:
      • Use adjtimex() to obtain tai_offset.
      • Use KVM_GET_CLOCK to read UTC time and host TSC (ignoring the
        actual kvm clock). These represent time Tsrc.
      • Use adjtimex() to obtain tai_offset again, looping back to the
        beginning if it changes.
      • Convert the UTC time to TAI by adding the tai_offset.

    • ∀ vCPU:
      • Read the scaling information with the KVM_CPU_TSC_SCALE
        attribute.
      • Read the offset with the KVM_CPU_TSC_OFFSET attribute.
      • Calculate this vCPU’s TSC at time Tsrc, from the host TSC
        value.

    • Use KVM_GET_CLOCK_GUEST to read the KVM clock (on vCPU0).

   (DESTINATION)
    • Sample both TAI and the (destination) host TSC at a time we shall
      call Tdst:
      • Use adjtimex() to obtain tai_offset.
      • Use KVM_GET_CLOCK to read UTC time and host TSC.
      • Use adjtimex() to obtain tai_offset again, looping back to the
        beginning if it changes.
      • Convert the UTC time to TAI by adding the tai_offset.

    • Calculate the time (in the TAI clock) elapsed between Tsrc and
      Tdst. Call this ΔT.

    • ∀ vCPU:
      • Calculate this vCPU’s intended TSC value at time Tdst:
        • Given this vCPU’s TSC frequency, calculate the number of TSC
          ticks correponding to ΔT.
        • Add this to the vCPU TSC value calculated on the source
        • Read the scaling information on the current host with the
          KVM_CPU_TSC_SCALE attribute
        • Calculate this vCPU’s scaled TSC value corresponding to the
          host TSC at time Tdst without taking offsetting into account.
        • Set KVM_CPU_TSC_OFFSET to the delta between that and the
          intended TSC value.

    • Use KVM_SET_CLOCK_GUEST to set the KVM clock (on vCPU0).





Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)