[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d9d7ce2-4dd1-4f54-a468-79ef5970a708@opensynergy.com>
Date: Thu, 27 Jun 2024 15:50:27 +0200
From: Peter Hilber <peter.hilber@...nsynergy.com>
To: David Woodhouse <dwmw2@...radead.org>, linux-kernel@...r.kernel.org,
virtualization@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org,
linux-rtc@...r.kernel.org, "Ridoux, Julien" <ridouxj@...zon.com>,
virtio-dev@...ts.linux.dev, "Luu, Ryan" <rluu@...zon.com>
Cc: "Christopher S. Hall" <christopher.s.hall@...el.com>,
Jason Wang <jasowang@...hat.com>, John Stultz <jstultz@...gle.com>,
"Michael S. Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org,
Richard Cochran <richardcochran@...il.com>, Stephen Boyd <sboyd@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>, Xuan Zhuo
<xuanzhuo@...ux.alibaba.com>, Marc Zyngier <maz@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
Alessandro Zummo <a.zummo@...ertech.it>,
Alexandre Belloni <alexandre.belloni@...tlin.com>
Subject: Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
On 25.06.24 21:01, David Woodhouse wrote:
> From: David Woodhouse <dwmw@...zon.co.uk>
>
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
>
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
>
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
>
> Signed-off-by: David Woodhouse <dwmw@...zon.co.uk>
> ---
>
> v2:
> • Add gettimex64() support
> • Convert TSC values to KVM clock when appropriate
> • Require int128 support
> • Add counter_period_shift
> • Add timeout when seq_count is invalid
> • Add flags field
> • Better comments in vmclock ABI structure
> • Explicitly forbid smearing (as clock rates would need to change)
Leap second smearing information could still be conveyed through the
vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
enough to indicate whether the driver should apply linear or cosine
smearing, and the start time and end time.
>
> drivers/ptp/Kconfig | 13 +
> drivers/ptp/Makefile | 1 +
> drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/vmclock.h | 138 ++++++++++
> 4 files changed, 668 insertions(+)
> create mode 100644 drivers/ptp/ptp_vmclock.c
> create mode 100644 include/uapi/linux/vmclock.h
>
[...]
> +
> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> + uint64_t period, uint8_t shift,
> + uint64_t frac_sec)
> +{
> + unsigned __int128 res = (unsigned __int128)delta * period;
> +
> + res >>= shift;
> + res += frac_sec;
> + *res_hi = res >> 64;
> + return (uint64_t)res;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> + struct ptp_system_timestamp *sts,
> + struct system_counterval_t *system_counter,
> + struct timespec64 *tspec)
> +{
> + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> + struct system_time_snapshot systime_snapshot;
> + uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> + /*
> + * We'd expect the hypervisor to know this and to report the clock
> + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> + */
> + if (check_tsc_unstable())
> + return -EINVAL;
> +#endif
> +
> + while (1) {
> + seq = st->clk->seq_count & ~1ULL;
> + virt_rmb();
> +
> + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> + return -EINVAL;
> +
> + /*
> + * When invoked for gettimex64(), fill in the pre/post system
> + * times. The simple case is when system time is based on the
> + * same counter as st->cs_id, in which case all three times
> + * will be derived from the *same* counter value.
> + *
> + * If the system isn't using the same counter, then the value
> + * from ktime_get_snapshot() will still be used as pre_ts, and
> + * ptp_read_system_postts() is called to populate postts after
> + * calling get_cycles().
> + *
> + * The conversion to timespec64 happens further down, outside
> + * the seq_count loop.
> + */
> + if (sts) {
> + ktime_get_snapshot(&systime_snapshot);
> + if (systime_snapshot.cs_id == st->cs_id) {
> + cycle = systime_snapshot.cycles;
> + } else {
> + cycle = get_cycles();
> + ptp_read_system_postts(sts);
> + }
> + } else
> + cycle = get_cycles();
> +
> + delta = cycle - st->clk->counter_value;
AFAIU in the general case this needs to be masked for non 64-bit counters.
> +
> + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> + st->clk->counter_period_frac_sec,
> + st->clk->counter_period_shift,
> + st->clk->utc_time_frac_sec);
> + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> + tspec->tv_sec += st->clk->utc_time_sec;
> +
> + virt_rmb();
> + if (seq == st->clk->seq_count)
> + break;
> +
> + if (ktime_after(ktime_get(), deadline))
> + return -ETIMEDOUT;
> + }
> +
> + if (system_counter) {
> + system_counter->cycles = cycle;
> + system_counter->cs_id = st->cs_id;
> + }
> +
> + if (sts) {
> + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> + if (systime_snapshot.cs_id == st->cs_id)
> + sts->post_ts = sts->pre_ts;
> + }
> +
> + return 0;
> +}
> +
[...]
> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> + .owner = THIS_MODULE,
> + .max_adj = 0,
> + .n_ext_ts = 0,
> + .n_pins = 0,
> + .pps = 0,
> + .adjfine = ptp_vmclock_adjfine,
> + .adjtime = ptp_vmclock_adjtime,
> + .gettime64 = ptp_vmclock_gettime,
The .gettime64 op is now unneeded.
> + .gettimex64 = ptp_vmclock_gettimex,
> + .settime64 = ptp_vmclock_settime,
> + .enable = ptp_vmclock_enable,
> + .getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +
[...]
> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> new file mode 100644
> index 000000000000..cf0f22205e79
> --- /dev/null
> +++ b/include/uapi/linux/vmclock.h
> @@ -0,0 +1,138 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> +
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + *
> + * Note that this must be true UTC, never with smeared leap seconds. If a
> + * guest wishes to construct a smeared clock, it can do so. Presenting a
> + * smeared clock through this interface would be problematic because it
> + * actually messes with the apparent counter *period*. A linear smearing
> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> + * at the start/end of the smearing period, while a sinusoidal smear would
> + * basically be impossible to represent.
Clock types other than UTC could also be supported: TAI, monotonic.
> + */
> +
> +#ifndef __VMCLOCK_H__
> +#define __VMCLOCK_H__
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> + uint32_t magic;
> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
> + uint16_t size; /* Size of page containing this structure */
> + uint16_t version; /* 1 */
> +
> + /* Sequence lock. Low bit means an update is in progress. */
> + uint64_t seq_count;
> +
> + /*
> + * This field changes to another non-repeating value when the CPU
> + * counter is disrupted, for example on live migration.
> + */
> + uint64_t disruption_marker;
The field could also change when the clock is stepped (leap seconds
excepted), or when the clock frequency is slewed.
Powered by blists - more mailing lists