[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4ed7a646-707a-40c9-93f9-2289fedf5709@oracle.com>
Date: Fri, 16 Jan 2026 01:31:06 -0800
From: Dongli Zhang <dongli.zhang@...cle.com>
To: David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org
Cc: seanjc@...gle.com, pbonzini@...hat.com, paul@....org, tglx@...nel.org,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, linux-kernel@...r.kernel.org,
joe.jin@...cle.com
Subject: Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by
masterclock update
Hi David,
On 1/15/26 1:13 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>
>> Please let me know if this is inappropriate and whether I should have
>> confirmed with you before reusing your code from the patch below, with your
>> authorship preserved.
>>
>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>
>> The objective is to trigger a discussion on whether there is any quick,
>> short-term solution to mitigate the kvm-clock drift issue. We can also
>> resurrect your patchset.
>>
>> I have some other work in QEMU userspace.
>>
>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>
>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>> drift during live migration very very trivial.
>>
>> Thank you very much!
>
> Not at all inappropriate; thank you so much for updating it. I've been
> meaning to do so but it's never made it back to the top of my list.
>
> I don't believe that the existing KVM_SET_CLOCK is viable though. The
> aim is that you should be able to create a new KVM on the same host and
> set the kvmclock, and the contents of the pvclock that the new guest
> sees should be *identical*. Not just 'close'.
>
> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
> feasible, so I'd very much prefer that any resurrection of this series
> should include that, even if some of the other patches are dropped for
> now.
>
> Thanks again.
Thank you very much for the feedback.
The issue addressed by this patchset cannot be resolved only by
KVM_[GS]ET_CLOCK_GUEST.
The problem I am trying to solve is avoiding unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
clock updates - typically proportional to the number of vCPUs.
As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
forward/backward drift.
Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
the other two of your patches, defining a new policy to minimize
KVM_REQ_MASTERCLOCK_UPDATE.
[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
from each other
Suppose the combination of QEMU and KVM. The following details explain the
problem I am trying to address.
(Assuming TSC scaling is *inactive*)
## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.
So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
For example, if a guest is live-migrated to a file, left idle for one hour, and
then restored from that file to the target host, the one-hour blackout period
will not be reflected in the kvm-clock or guest TSC.
This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU. I have sent a QEMU patch (and just received your feedback on that thread).
[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
## Problem 2. The kvm-clock drifts due to changes in the PVTI data.
Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
migration the amount of drift is not determined by the time elapsed between two
masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
not stopped or resumed at the same point in time.
For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
guest_clock on the source host. This is effectively equivalent to stopping their
counters. However, they are not stopped simultaneously: guest_tsc stops at time
point P1, while guest_clock stops at time point P2.
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
... ...
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_GET_CLOCK ===> P2
On the target host, QEMU restores the saved values using MSR_IA32_TSC and
KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
guest_clock resumes counting at time point P4.
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
... ...
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_SET_CLOCK ====> P4
Therefore, below are the equations I use to calculate the expected kvm-clock drift.
T1_ns = P2 - P1 (nanoseconds)
T2_tsc = P4 - P3 (cycles)
T2_ns = pvclock_scale_delta(T2_tsc,
hv_clock_src.tsc_to_system_mul,
hv_clock_src.tsc_shift)
if (T2_ns > T1_ns)
backward drift: T2_ns - T1_ns
else if (T1_ns > T2_ns)
forward drift: T1_ns - T2_ns
To fix this issue, ideally both guest_tsc and guest_clock should be stopped and
resumed at exactly the same time.
As you mentioned in the QEMU patch, "the kvmclock should be a fixed relationship
from the guest's TSC which doesn't change for the whole lifetime of the guest."
Fortunately, to take advantage of KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU can achieve the same goal.
[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
## Problem 3. Unfortunately, unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests are
being triggered for the vCPUs.
During kvm_synchronize_tsc() or kvm_arch_tsc_set_attr(KVM_VCPU_TSC_OFFSET),
KVM_REQ_MASTERCLOCK_UPDATE requests may be set either before or after KVM_SET_CLOCK.
As a result, once all vCPUs are unpaused, these unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests can lead to kvm-clock drift.
Indeed, only PATCH 1 and PATCH 3 from this patch set are sufficient to mitigate
the issue.
With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
VM with approximately 10 seconds of downtime (introduced on purpose) results in
only about 4 nanoseconds of backward drift in my test environment. We may even
be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.
old_clock->tsc_timestamp = 32041800585
old_clock->system_time = 3639151
old_clock->tsc_to_system_mul = 3186238974
old_clock->tsc_shift = -1
new_clock->tsc_timestamp = 213016088950
new_clock->system_time = 67131895453
new_clock->tsc_to_system_mul = 3186238974
new_clock->tsc_shift = -1
If I do not introduce the ~10 seconds of downtime on purpose during live
migration, the drift is always 0 nanoseconds.
I introduce downtime on purpose by stopping the target QEMU before live
migration. The target QEMU will not resume until the 'cont' command is issued in
the QEMU monitor.
Regarding goal, I appreciate if there can be any quick solution (even short
term) or half-measures to support:
- Account for live migration downtime.
- Minimize kvm-clock drift (especially backward).
Thank you very much!
Dongli Zhang
Powered by blists - more mailing lists