linux-kernel - Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d9d7ce2-4dd1-4f54-a468-79ef5970a708@opensynergy.com>
Date: Thu, 27 Jun 2024 15:50:27 +0200
From: Peter Hilber <peter.hilber@...nsynergy.com>
To: David Woodhouse <dwmw2@...radead.org>, linux-kernel@...r.kernel.org,
 virtualization@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org,
 linux-rtc@...r.kernel.org, "Ridoux, Julien" <ridouxj@...zon.com>,
 virtio-dev@...ts.linux.dev, "Luu, Ryan" <rluu@...zon.com>
Cc: "Christopher S. Hall" <christopher.s.hall@...el.com>,
 Jason Wang <jasowang@...hat.com>, John Stultz <jstultz@...gle.com>,
 "Michael S. Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org,
 Richard Cochran <richardcochran@...il.com>, Stephen Boyd <sboyd@...nel.org>,
 Thomas Gleixner <tglx@...utronix.de>, Xuan Zhuo
 <xuanzhuo@...ux.alibaba.com>, Marc Zyngier <maz@...nel.org>,
 Mark Rutland <mark.rutland@....com>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Alessandro Zummo <a.zummo@...ertech.it>,
 Alexandre Belloni <alexandre.belloni@...tlin.com>
Subject: Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

On 25.06.24 21:01, David Woodhouse wrote:
> From: David Woodhouse <dwmw@...zon.co.uk>
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
> 
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
> 
> Signed-off-by: David Woodhouse <dwmw@...zon.co.uk>
> ---
> 
> v2: 
>  • Add gettimex64() support
>  • Convert TSC values to KVM clock when appropriate
>  • Require int128 support
>  • Add counter_period_shift 
>  • Add timeout when seq_count is invalid
>  • Add flags field
>  • Better comments in vmclock ABI structure
>  • Explicitly forbid smearing (as clock rates would need to change)

Leap second smearing information could still be conveyed through the
vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
enough to indicate whether the driver should apply linear or cosine
smearing, and the start time and end time.

> 
>  drivers/ptp/Kconfig          |  13 +
>  drivers/ptp/Makefile         |   1 +
>  drivers/ptp/ptp_vmclock.c    | 516 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vmclock.h | 138 ++++++++++
>  4 files changed, 668 insertions(+)
>  create mode 100644 drivers/ptp/ptp_vmclock.c
>  create mode 100644 include/uapi/linux/vmclock.h
> 

[...]

> +
> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> +					       uint64_t period, uint8_t shift,
> +					       uint64_t frac_sec)
> +{
> +	unsigned __int128 res = (unsigned __int128)delta * period;
> +
> +	res >>= shift;
> +	res += frac_sec;
> +	*res_hi = res >> 64;
> +	return (uint64_t)res;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> +				   struct ptp_system_timestamp *sts,
> +				   struct system_counterval_t *system_counter,
> +				   struct timespec64 *tspec)
> +{
> +	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> +	struct system_time_snapshot systime_snapshot;
> +	uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> +	/*
> +	 * We'd expect the hypervisor to know this and to report the clock
> +	 * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> +	 */
> +	if (check_tsc_unstable())
> +		return -EINVAL;
> +#endif
> +
> +	while (1) {
> +		seq = st->clk->seq_count & ~1ULL;
> +		virt_rmb();
> +
> +		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> +			return -EINVAL;
> +
> +		/*
> +		 * When invoked for gettimex64(), fill in the pre/post system
> +		 * times. The simple case is when system time is based on the
> +		 * same counter as st->cs_id, in which case all three times
> +		 * will be derived from the *same* counter value.
> +		 *
> +		 * If the system isn't using the same counter, then the value
> +		 * from ktime_get_snapshot() will still be used as pre_ts, and
> +		 * ptp_read_system_postts() is called to populate postts after
> +		 * calling get_cycles().
> +		 *
> +		 * The conversion to timespec64 happens further down, outside
> +		 * the seq_count loop.
> +		 */
> +		if (sts) {
> +			ktime_get_snapshot(&systime_snapshot);
> +			if (systime_snapshot.cs_id == st->cs_id) {
> +				cycle = systime_snapshot.cycles;
> +			} else {
> +				cycle = get_cycles();
> +				ptp_read_system_postts(sts);
> +			}
> +		} else
> +			cycle = get_cycles();
> +
> +		delta = cycle - st->clk->counter_value;

AFAIU in the general case this needs to be masked for non 64-bit counters.

> +
> +		frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> +						   st->clk->counter_period_frac_sec,
> +						   st->clk->counter_period_shift,
> +						   st->clk->utc_time_frac_sec);
> +		tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> +		tspec->tv_sec += st->clk->utc_time_sec;
> +
> +		virt_rmb();
> +		if (seq == st->clk->seq_count)
> +			break;
> +
> +		if (ktime_after(ktime_get(), deadline))
> +			return -ETIMEDOUT;
> +	}
> +
> +	if (system_counter) {
> +		system_counter->cycles = cycle;
> +		system_counter->cs_id = st->cs_id;
> +	}
> +
> +	if (sts) {
> +		sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> +		if (systime_snapshot.cs_id == st->cs_id)
> +			sts->post_ts = sts->pre_ts;
> +	}
> +
> +	return 0;
> +}
> +

[...]

> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> +	.owner		= THIS_MODULE,
> +	.max_adj	= 0,
> +	.n_ext_ts	= 0,
> +	.n_pins		= 0,
> +	.pps		= 0,
> +	.adjfine	= ptp_vmclock_adjfine,
> +	.adjtime	= ptp_vmclock_adjtime,
> +	.gettime64	= ptp_vmclock_gettime,

The .gettime64 op is now unneeded.

> +	.gettimex64	= ptp_vmclock_gettimex,
> +	.settime64	= ptp_vmclock_settime,
> +	.enable		= ptp_vmclock_enable,
> +	.getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +

[...]

> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> new file mode 100644
> index 000000000000..cf0f22205e79
> --- /dev/null
> +++ b/include/uapi/linux/vmclock.h
> @@ -0,0 +1,138 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> +
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + *
> + * Note that this must be true UTC, never with smeared leap seconds. If a
> + * guest wishes to construct a smeared clock, it can do so. Presenting a
> + * smeared clock through this interface would be problematic because it
> + * actually messes with the apparent counter *period*. A linear smearing
> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> + * at the start/end of the smearing period, while a sinusoidal smear would
> + * basically be impossible to represent.

Clock types other than UTC could also be supported: TAI, monotonic.

> + */
> +
> +#ifndef __VMCLOCK_H__
> +#define __VMCLOCK_H__
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> +	uint32_t magic;
> +#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
> +	uint16_t size;		/* Size of page containing this structure */
> +	uint16_t version;	/* 1 */
> +
> +	/* Sequence lock. Low bit means an update is in progress. */
> +	uint64_t seq_count;
> +
> +	/*
> +	 * This field changes to another non-repeating value when the CPU
> +	 * counter is disrupted, for example on live migration.
> +	 */
> +	uint64_t disruption_marker;

The field could also change when the clock is stepped (leap seconds
excepted), or when the clock frequency is slewed.