linux-kernel - Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <005911c5-7f9d-4397-8145-a1ad4494484d@xen.org>
Date: Wed, 10 Apr 2024 11:29:13 +0100
From: Paul Durrant <xadimgnik@...il.com>
To: Jack Allister <jalliste@...zon.com>
Cc: bp@...en8.de, corbet@....net, dave.hansen@...ux.intel.com,
 dwmw2@...radead.org, hpa@...or.com, kvm@...r.kernel.org,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, mingo@...hat.com,
 pbonzini@...hat.com, seanjc@...gle.com, tglx@...utronix.de, x86@...nel.org,
 Dongli Zhang <dongli.zhang@...cle.com>
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate
 KVM clock migration

On 10/04/2024 10:52, Jack Allister wrote:
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based on
> a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
> 
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which is
> necessarily imprecise.
> 
> So on live update, the guest TSC can remain cycle accurate at precisely the
> same offset from the host TSC, but there is no way for userspace to restore
> the KVM clock accurately.
> 
> Even on live migration to a new host, where the accuracy of the guest time-
> keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest TSC,
> and not suffer an *additional* discontinuity.
> 
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
> save and restore the actual PV clock info in pvclock_vcpu_time_info.
> 
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed through
> the user-provided pvclock structure to generate the *intended* KVM clock
> value at that point in time, and through the *actual* KVM clock calculation.
> Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
> 
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
> 
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
> *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
> migration data (unless the intent is to refuse to resume on a host with bad
> TSC).
> 
> (It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
> non-masterclock mode, as that mode is necessarily imprecise anyway. The
> explicit fallback allows userspace to deliberately fail migration to a host
> with misbehaving TSC where master clock mode wouldn't be active.)
> 
> Suggested-by: David Woodhouse <dwmw2@...radead.org>
> Signed-off-by: Jack Allister <jalliste@...zon.com>
> CC: Paul Durrant <paul@....org>
> CC: Dongli Zhang <dongli.zhang@...cle.com>
> ---
>   Documentation/virt/kvm/api.rst |  37 ++++++++++
>   arch/x86/kvm/x86.c             | 124 +++++++++++++++++++++++++++++++++
>   include/uapi/linux/kvm.h       |   3 +
>   3 files changed, 164 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..80fcd93bba1b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
>   
>   See KVM_SET_USER_MEMORY_REGION2 for additional details.
>   
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock not is in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL.

EINVAL doesn't seem appropriate. ENOTSUP perhaps? Same for getting the 
clock info I suppose.

> Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
>   5. The kvm_run structure
>   ========================
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..d5cae3ead04d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5859,6 +5859,124 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>   	}
>   }
>   
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +	struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
> +	struct pvclock_vcpu_time_info local_pvti = { 0 };
> +	struct kvm_arch *ka = &v->kvm->arch;
> +	uint64_t host_tsc, guest_tsc;
> +	bool use_master_clock;
> +	uint64_t kernel_ns;
> +	unsigned int seq;
> +
> +	/*
> +	 * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
> +	 * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
> +	 */
> +	if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +		return -EINVAL;
> +
> +	/*
> +	 * Querying needs to be performed in a seqcount loop as it's possible
> +	 * another vCPU has triggered an update of the master clock. If so we
> +	 * should store the host TSC & time at this point.
> +	 */
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +		use_master_clock = ka->use_master_clock;
> +		if (use_master_clock) {
> +			host_tsc = ka->master_cycle_now;
> +			kernel_ns = ka->master_kernel_ns;
> +		}
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));

You could bail from the loop if `use_master_clock` is false, couldn't you?

> +
> +	if (!use_master_clock)
> +		return -EINVAL;
> +
> +	/*
> +	 * It's possible that this vCPU doesn't have a HVCLOCK configured
> +	 * but the other vCPUs may. If this is the case calculate based
> +	 * upon the time gathered in the seqcount but do not update the
> +	 * vCPU specific PVTI. If we have one, then use that.

Given this is a per-vCPU ioctl, why not fail in the case the vCPU 
doesn't have HVCLOCK configured? Or is your intention that a GET/SET 
should always work if TSC is stable?

> +	 */
> +	if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {
> +		guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> +		local_pvti.tsc_timestamp = guest_tsc;
> +		local_pvti.system_time = kernel_ns + ka->kvmclock_offset;
> +	} else {
> +		local_pvti = *vcpu_pvti;
> +	}
> +
> +	if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +