lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4ed7a646-707a-40c9-93f9-2289fedf5709@oracle.com>
Date: Fri, 16 Jan 2026 01:31:06 -0800
From: Dongli Zhang <dongli.zhang@...cle.com>
To: David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org
Cc: seanjc@...gle.com, pbonzini@...hat.com, paul@....org, tglx@...nel.org,
        mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
        x86@...nel.org, hpa@...or.com, linux-kernel@...r.kernel.org,
        joe.jin@...cle.com
Subject: Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by
 masterclock update

Hi David,

On 1/15/26 1:13 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>
>> Please let me know if this is inappropriate and whether I should have
>> confirmed with you before reusing your code from the patch below, with your
>> authorship preserved.
>>
>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>
>> The objective is to trigger a discussion on whether there is any quick,
>> short-term solution to mitigate the kvm-clock drift issue. We can also
>> resurrect your patchset.
>>
>> I have some other work in QEMU userspace.
>>
>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>
>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>> drift during live migration very very trivial.
>>
>> Thank you very much!
> 
> Not at all inappropriate; thank you so much for updating it. I've been
> meaning to do so but it's never made it back to the top of my list.
> 
> I don't believe that the existing KVM_SET_CLOCK is viable though. The
> aim is that you should be able to create a new KVM on the same host and
> set the kvmclock, and the contents of the pvclock that the new guest
> sees should be *identical*. Not just 'close'.
> 
> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
> feasible, so I'd very much prefer that any resurrection of this series
> should include that, even if some of the other patches are dropped for
> now.
> 
> Thanks again.

Thank you very much for the feedback.

The issue addressed by this patchset cannot be resolved only by
KVM_[GS]ET_CLOCK_GUEST.

The problem I am trying to solve is avoiding unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
clock updates - typically proportional to the number of vCPUs.

As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
forward/backward drift.

Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
the other two of your patches, defining a new policy to minimize
KVM_REQ_MASTERCLOCK_UPDATE.

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
from each other


Suppose the combination of QEMU and KVM. The following details explain the
problem I am trying to address.

(Assuming TSC scaling is *inactive*)


## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.

So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
For example, if a guest is live-migrated to a file, left idle for one hour, and
then restored from that file to the target host, the one-hour blackout period
will not be reflected in the kvm-clock or guest TSC.

This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU. I have sent a QEMU patch (and just received your feedback on that thread).

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/


## Problem 2. The kvm-clock drifts due to changes in the PVTI data.

Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
migration the amount of drift is not determined by the time elapsed between two
masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
not stopped or resumed at the same point in time.

For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
guest_clock on the source host. This is effectively equivalent to stopping their
counters. However, they are not stopped simultaneously: guest_tsc stops at time
point P1, while guest_clock stops at time point P2.

- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
... ...
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_GET_CLOCK                               ===> P2

On the target host, QEMU restores the saved values using MSR_IA32_TSC and
KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
guest_clock resumes counting at time point P4.

- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
... ...
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_SET_CLOCK                               ====> P4


Therefore, below are the equations I use to calculate the expected kvm-clock drift.

T1_ns  = P2 - P1 (nanoseconds)
T2_tsc = P4 - P3 (cycles)
T2_ns  = pvclock_scale_delta(T2_tsc,
                             hv_clock_src.tsc_to_system_mul,
                             hv_clock_src.tsc_shift)

if (T2_ns > T1_ns)
    backward drift: T2_ns - T1_ns
else if (T1_ns > T2_ns)
    forward drift: T1_ns - T2_ns


To fix this issue, ideally both guest_tsc and guest_clock should be stopped and
resumed at exactly the same time.

As you mentioned in the QEMU patch, "the kvmclock should be a fixed relationship
from the guest's TSC which doesn't change for the whole lifetime of the guest."

Fortunately, to take advantage of KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU can achieve the same goal.

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/


## Problem 3. Unfortunately, unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests are
being triggered for the vCPUs.

During kvm_synchronize_tsc() or kvm_arch_tsc_set_attr(KVM_VCPU_TSC_OFFSET),
KVM_REQ_MASTERCLOCK_UPDATE requests may be set either before or after KVM_SET_CLOCK.

As a result, once all vCPUs are unpaused, these unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests can lead to kvm-clock drift.

Indeed, only PATCH 1 and PATCH 3 from this patch set are sufficient to mitigate
the issue.


With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
VM with approximately 10 seconds of downtime (introduced on purpose) results in
only about 4 nanoseconds of backward drift in my test environment. We may even
be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.

old_clock->tsc_timestamp = 32041800585
old_clock->system_time = 3639151
old_clock->tsc_to_system_mul = 3186238974
old_clock->tsc_shift = -1

new_clock->tsc_timestamp = 213016088950
new_clock->system_time = 67131895453
new_clock->tsc_to_system_mul = 3186238974
new_clock->tsc_shift =  -1

If I do not introduce the ~10 seconds of downtime on purpose during live
migration, the drift is always 0 nanoseconds.

I introduce downtime on purpose by stopping the target QEMU before live
migration. The target QEMU will not resume until the 'cont' command is issued in
the QEMU monitor.


Regarding goal, I appreciate if there can be any quick solution (even short
term) or half-measures to support:

- Account for live migration downtime.
- Minimize kvm-clock drift (especially backward).

Thank you very much!

Dongli Zhang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ