linux-kernel - Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1195c194-a2cc-6793-009c-376f091be7f0@oracle.com>
Date: Mon, 8 Apr 2024 17:34:12 -0700
From: Dongli Zhang <dongli.zhang@...cle.com>
To: Jack Allister <jalliste@...zon.com>, Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Sean Christopherson <seanjc@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>
Cc: David Woodhouse <dwmw2@...radead.org>, Paul Durrant <paul@....org>,
        kvm@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock
 drift fixup

Hi Jack,

On 4/8/24 15:07, Jack Allister wrote:
> There is a potential for drift between the TSC and a KVM/PV clock when the
> guest TSC is scaled (as seen previously in [1]). Which fixed drift between
> timers over the lifetime of a VM.

Those patches mentioned "TSC scaling" mutiple times. Is it a necessary to
reproduce this issue? I do not think it is necessary. The tsc scaling may speed
up the drift, but not the root cause.

How about to cite the below patch as the beginning. The below patch only
*avoids* KVM_REQ_MASTERCLOCK_UPDATE in some situations, but never solve the
problem when KVM_REQ_MASTERCLOCK_UPDATE is triggered ... therefore we need this
patchset ...

KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c52ffadc65e28ab461fd055e9991e8d8106a0056

I think this patch is only closely related to KVM_REQ_MASTERCLOCK_UPDATE, not
TSC scaling.

> 
> However, there is another factor which will cause a drift. In a situation
> such as a kexec/live-update of the kernel or a live-migration of a VM the
> PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE).
> This update samples a new system_time & tsc_timestamp to be used in the
> structure.
> 
> For example, when a guest is running with a TSC frequency of 1.5GHz but the
> host frequency is 3.0GHz upon an update of the PV time information a delta
> of ~3500ns is observed between the TSC and the KVM/PV clock. There is no
> reason why a fixup creating an accuracy of ±1ns cannot be achieved.

Same as above. I think the key is to explain the issue when
KVM_REQ_MASTERCLOCK_UPDATE is triggered, not to emphasize the TSC scaling.
Please correct me if I am wrong.

> 
> Additional interfaces are added to retrieve & fixup the PV time information
> when a VMM may believe is appropriate (deserialization after live-update/
> migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
> currently used PV time information and then when the VMM believes a drift
> may occur can then instruct KVM to perform a correction via the setter
> KVM_SET_CLOCK_GUEST.
> 
> The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
> TSC & kernel timstamp are sampled at a singular point in time. Using the

Typo: "timstamp"

> already known scaling/offset for L1 the guest TSC is then derived from this

I assume you meant to derive guest TSC from TSC offset/scaling, not to derive
kvmclock. What does "TSC & kernel timstamp" mean?

> information.
> 
> From here two PV time information structures are created, one which is the
> original time information structure prior to whatever may have caused a PV
> clock re-calculation (live-update/migration). The second is then using the
> singular point in time sampled just prior. An individual KVM/PV clock for
> each of the PV time information structures using the singular guest TSC.
> 
> A delta is then determined between the two calculated PV times, which is
> then used as a correction offset added onto the kvmclock_offset for the VM.
> 
> [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!OnMXeXj4Plz6xvAc5lYsKaR3d1GDGGGRhZkdLMbxr8Skc_VAv_O1H8qP9igQv4KPCtYDw2ShTUtEd2o3mD5R$ 
> 
> Suggested-by: David Woodhouse <dwmw2@...radead.org>
> Signed-off-by: Jack Allister <jalliste@...zon.com>
> CC: Paul Durrant <paul@....org>
> ---
>  Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
>  arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  3 ++
>  3 files changed, 133 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..5f74d8ac1002 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
>  
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks.
> +On x86 a PV clock is derived from the current TSC and is then scaled based
> +upon the a specified multiplier and shift. The result of this is then added
> +to a system time.

Typo: "the a".

> +
> +The guest needs a way to determine the system time, multiplier and shift. This
> +can be done by multiple ways, for KVM guests this can be via an MSR write to
> +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
> +address KVM shall put the structure. On Xen guests this can be found in the Xen
> +vcpu_info.
> +
> +This is structure is useful information for a VMM to also know when taking into
> +account potential timer drift on live-update/migration.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
> +known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
> +
> +If a VM is utilizing TSC scaling there is a potential for a drift between the
> +KVM/PV clock and the TSC itself. This is due to the loss of precision when
> +performing a multiply and shift rather than divide for the TSC.
> +
> +To perform the correction a delta is calculated between the original time info
> +(which is assumed correct) at a singular point in time X. The KVM clock offset
> +is then offset by this delta.
> +
>  5. The kvm_run structure
>  ========================
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..5d2e10cd1c30 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  	return 0;
>  }
>  
> +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu = NULL;
> +	int i;
> +
> +	for (i = 0; i < KVM_MAX_VCPUS; i++) {
> +		vcpu = kvm_get_vcpu_by_id(kvm, i);
> +		if (!vcpu)
> +			continue;
> +
> +		if (kvm_vcpu_is_reset_bsp(vcpu))
> +			break;
> +	}
> +
> +	return vcpu;
> +}

Would the above rely not only on TSC clocksource, but also
ka->use_master_clock==true?

 3125         ka->use_master_clock = host_tsc_clocksource && vcpus_matched
 3126                                 && !ka->backwards_tsc_observed
 3127                                 && !ka->boot_vcpu_runs_old_kvmclock;

Should the condition of (ka->use_master_clock==true) be checked in the ioctl?

> +
> +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
> +{
> +	struct kvm_vcpu *vcpu;
> +
> +	vcpu = kvm_get_bsp_vcpu(kvm);
> +	if (!vcpu)
> +		return -EINVAL;
> +
> +	if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
> +		return -EIO;
> +
> +	if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
> +		return -EFAULT;

What will happen if the vCPU=0 (e.g., BSP) thread is racing with here to update
the vcpu->arch.hv_clock?

It is a good idea to making assumption from the VMM (e.g., QEMU) side?

> +
> +	return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
> +{
> +	struct kvm_vcpu *vcpu;
> +	struct pvclock_vcpu_time_info orig_pvti;
> +	struct pvclock_vcpu_time_info dummy_pvti;
> +	int64_t kernel_ns;
> +	uint64_t host_tsc, guest_tsc;
> +	uint64_t clock_orig, clock_dummy;
> +	int64_t correction;
> +	unsigned long i;

Please ignore me if there is not any chance to make the above (and other places
in the patchset) to honor reverse xmas tree style.

> +
> +	vcpu = kvm_get_bsp_vcpu(kvm);
> +	if (!vcpu)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> +		return -EFAULT;
> +
> +	/*
> +	 * Sample the kernel time and host TSC at a singular point.
> +	 * We then calculate the guest TSC using this exact point in time,
> +	 * From here we can then determine the delta using the
> +	 * PV time info requested from the user and what we currently have
> +	 * using the fixed point in time. This delta is then used as a
> +	 * correction factor to fixup the potential drift.
> +	 */
> +	if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
> +		return -EFAULT;
> +
> +	guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> +	dummy_pvti = orig_pvti;
> +	dummy_pvti.tsc_timestamp = guest_tsc;
> +	dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
> +
> +	clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> +	clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
> +
> +	correction = clock_orig - clock_dummy;
> +	kvm->arch.kvmclock_offset += correction;

I am not sure if it is a good idea to rely on userspace VMM to decide the good
timepoint to issue the ioctl, without assuming any racing.

In addition to live migration, can the user call this API any time during the VM
is running (to fix the clock drift)? Therefore, any requirement to protect the
update of kvmclock_offset from racing?


Thank you very much!

Dongli Zhang


> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +
> +	return 0;
> +}
> +
>  int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
>  	struct kvm *kvm = filp->private_data;
> @@ -7246,6 +7327,12 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  	case KVM_GET_CLOCK:
>  		r = kvm_vm_ioctl_get_clock(kvm, argp);
>  		break;
> +	case KVM_SET_CLOCK_GUEST:
> +		r = kvm_vm_ioctl_set_clock_guest(kvm, argp);
> +		break;
> +	case KVM_GET_CLOCK_GUEST:
> +		r = kvm_vm_ioctl_get_clock_guest(kvm, argp);
> +		break;
>  	case KVM_SET_TSC_KHZ: {
>  		u32 user_tsc_khz;
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
>  	__u64 reserved[6];
>  };
>  
> +#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */