linux-kernel - Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bdcafc76ea44db244b52f8a092287cb33950d5d6.camel@infradead.org>
Date: Thu, 27 Jun 2024 15:52:29 +0100
From: David Woodhouse <dwmw2@...radead.org>
To: Peter Hilber <peter.hilber@...nsynergy.com>,
 linux-kernel@...r.kernel.org,  virtualization@...ts.linux.dev,
 linux-arm-kernel@...ts.infradead.org,  linux-rtc@...r.kernel.org, "Ridoux,
 Julien" <ridouxj@...zon.com>,  virtio-dev@...ts.linux.dev, "Luu, Ryan"
 <rluu@...zon.com>
Cc: "Christopher S. Hall" <christopher.s.hall@...el.com>, Jason Wang
 <jasowang@...hat.com>, John Stultz <jstultz@...gle.com>, "Michael S.
 Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org, Richard Cochran
 <richardcochran@...il.com>, Stephen Boyd <sboyd@...nel.org>, Thomas
 Gleixner <tglx@...utronix.de>, Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Marc
 Zyngier <maz@...nel.org>, Mark Rutland <mark.rutland@....com>, Daniel
 Lezcano <daniel.lezcano@...aro.org>, Alessandro Zummo
 <a.zummo@...ertech.it>,  Alexandre Belloni <alexandre.belloni@...tlin.com>
Subject: Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
> On 25.06.24 21:01, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@...zon.co.uk>
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse <dwmw@...zon.co.uk>
> > ---
> > 
> > v2: 
> >  • Add gettimex64() support
> >  • Convert TSC values to KVM clock when appropriate
> >  • Require int128 support
> >  • Add counter_period_shift 
> >  • Add timeout when seq_count is invalid
> >  • Add flags field
> >  • Better comments in vmclock ABI structure
> >  • Explicitly forbid smearing (as clock rates would need to change)
> 
> Leap second smearing information could still be conveyed through the
> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
> enough to indicate whether the driver should apply linear or cosine
> smearing, and the start time and end time.

Yes. The clock information actually conveyed through the {counter,
time, rate} tuple should never be smeared, and should only ever be UTC.

But we could provide a hint to the guest operating system about what
type of smearing to perform, *if* it chooses to offer a clock other
than the standard CLOCK_REALTIME to its users.

I already added a flags field, so this might look something like:

        /*
         * Smearing flags. The UTC clock exposed through this structure
         * is only ever true UTC, but a guest operating system may
         * choose to offer a monotonic smeared clock to its users. This
         * merely offers a hint about what kind of smearing to perform,
         * for consistency with systems in the nearby environment.
         */
#define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */


(UTC-SLS is probably a bad example but are there formal definitions for
anything else?)


> > But we 
> >  drivers/ptp/Kconfig          |  13 +
> >  drivers/ptp/Makefile         |   1 +
> >  drivers/ptp/ptp_vmclock.c    | 516 +++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vmclock.h | 138 ++++++++++
> >  4 files changed, 668 insertions(+)
> >  create mode 100644 drivers/ptp/ptp_vmclock.c
> >  create mode 100644 include/uapi/linux/vmclock.h
> > 
> 
> [...]
> 
> > +
> > +/*
> > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> > + * and add the fractional second part of the reference time.
> > + *
> > + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> > + * the low 64 bits are (seconds >> 64).
> > + *
> > + * If __int128 isn't available, perform the calculation 32 bits at a time to
> > + * avoid overflow.
> > + */
> > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> > +                                              uint64_t period, uint8_t shift,
> > +                                              uint64_t frac_sec)
> > +{
> > +       unsigned __int128 res = (unsigned __int128)delta * period;
> > +
> > +       res >>= shift;
> > +       res += frac_sec;
> > +       *res_hi = res >> 64;
> > +       return (uint64_t)res;
> > +}
> > +
> > +static int vmclock_get_crosststamp(struct vmclock_state *st,
> > +                                  struct ptp_system_timestamp *sts,
> > +                                  struct system_counterval_t *system_counter,
> > +                                  struct timespec64 *tspec)
> > +{
> > +       ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> > +       struct system_time_snapshot systime_snapshot;
> > +       uint64_t cycle, delta, seq, frac_sec;
> > +
> > +#ifdef CONFIG_X86
> > +       /*
> > +        * We'd expect the hypervisor to know this and to report the clock
> > +        * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> > +        */
> > +       if (check_tsc_unstable())
> > +               return -EINVAL;
> > +#endif
> > +
> > +       while (1) {
> > +               seq = st->clk->seq_count & ~1ULL;
> > +               virt_rmb();
> > +
> > +               if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> > +                       return -EINVAL;
> > +
> > +               /*
> > +                * When invoked for gettimex64(), fill in the pre/post system
> > +                * times. The simple case is when system time is based on the
> > +                * same counter as st->cs_id, in which case all three times
> > +                * will be derived from the *same* counter value.
> > +                *
> > +                * If the system isn't using the same counter, then the value
> > +                * from ktime_get_snapshot() will still be used as pre_ts, and
> > +                * ptp_read_system_postts() is called to populate postts after
> > +                * calling get_cycles().
> > +                *
> > +                * The conversion to timespec64 happens further down, outside
> > +                * the seq_count loop.
> > +                */
> > +               if (sts) {
> > +                       ktime_get_snapshot(&systime_snapshot);
> > +                       if (systime_snapshot.cs_id == st->cs_id) {
> > +                               cycle = systime_snapshot.cycles;
> > +                       } else {
> > +                               cycle = get_cycles();
> > +                               ptp_read_system_postts(sts);
> > +                       }
> > +               } else
> > +                       cycle = get_cycles();
> > +
> > +               delta = cycle - st->clk->counter_value;
> 
> AFAIU in the general case this needs to be masked for non 64-bit counters.

Are there any non-64-bit counters that will be exposed this way? And
can we handle that with explicit knowledge of the counter type, rather
than a separate explicit field?


If we really dig deep, it gets horrible for scaled TSCs. The guest TSC
is actually calculated from the host TSC with a scale and offset:

  guest_tsc = ((host_tsc * SCALE) >> 48_or_32) + OFFSET

(The shift is 48 bits for Intel, 32 for AMD).

So when the host TSC wraps to zero, the guest TSC jumps from some non-
power-of-two upper limit, to OFFSET which is generally a *negative*
value. And will continue counting from there until it actually reaches
the 64-bit limit and wraps back to zero.


> 
> > +
> > +               frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> > +                                                  st->clk->counter_period_frac_sec,
> > +                                                  st->clk->counter_period_shift,
> > +                                                  st->clk->utc_time_frac_sec);
> > +               tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> > +               tspec->tv_sec += st->clk->utc_time_sec;
> > +
> > +               virt_rmb();
> > +               if (seq == st->clk->seq_count)
> > +                       break;
> > +
> > +               if (ktime_after(ktime_get(), deadline))
> > +                       return -ETIMEDOUT;
> > +       }
> > +
> > +       if (system_counter) {
> > +               system_counter->cycles = cycle;
> > +               system_counter->cs_id = st->cs_id;
> > +       }
> > +
> > +       if (sts) {
> > +               sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> > +               if (systime_snapshot.cs_id == st->cs_id)
> > +                       sts->post_ts = sts->pre_ts;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> 
> [...]
> 
> > +
> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > +       .owner          = THIS_MODULE,
> > +       .max_adj        = 0,
> > +       .n_ext_ts       = 0,
> > +       .n_pins         = 0,
> > +       .pps            = 0,
> > +       .adjfine        = ptp_vmclock_adjfine,
> > +       .adjtime        = ptp_vmclock_adjtime,
> > +       .gettime64      = ptp_vmclock_gettime,
> 
> The .gettime64 op is now unneeded.

Ack, thanks.

> > +       .gettimex64     = ptp_vmclock_gettimex,
> > +       .settime64      = ptp_vmclock_settime,
> > +       .enable         = ptp_vmclock_enable,
> > +       .getcrosststamp = ptp_vmclock_getcrosststamp,
> > +};
> > +
> 
> [...]
> 
> > diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> > new file mode 100644
> > index 000000000000..cf0f22205e79
> > --- /dev/null
> > +++ b/include/uapi/linux/vmclock.h
> > @@ -0,0 +1,138 @@
> > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> > +
> > +/*
> > + * This structure provides a vDSO-style clock to VM guests, exposing the
> > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> > + * counter, etc.) and real time. It is designed to address the problem of
> > + * live migration, which other clock enlightenments do not.
> > + *
> > + * When a guest is live migrated, this affects the clock in two ways.
> > + *
> > + * First, even between identical hosts the actual frequency of the underlying
> > + * counter will change within the tolerances of its specification (typically
> > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> > + * same host, but can be tracked by NTP as it generally varies slowly. With
> > + * live migration there is a step change in the frequency, with no warning.
> > + *
> > + * Second, there may be a step change in the value of the counter itself, as
> > + * its accuracy is limited by the precision of the NTP synchronization on the
> > + * source and destination hosts.
> > + *
> > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> > + * host before migration is invalid, and needs to be redone on the new host.
> > + *
> > + * In its most basic mode, this structure provides only an indication to the
> > + * guest that live migration has occurred. This allows the guest to know that
> > + * its clock is invalid and take remedial action. For applications that need
> > + * reliable accurate timestamps (e.g. distributed databases), the structure
> > + * can be mapped all the way to userspace. This allows the application to see
> > + * directly for itself that the clock is disrupted and take appropriate
> > + * action, even when using a vDSO-style method to get the time instead of a
> > + * system call.
> > + *
> > + * In its more advanced mode. this structure can also be used to expose the
> > + * precise relationship of the CPU counter to real time, as calibrated by the
> > + * host. This means that userspace applications can have accurate time
> > + * immediately after live migration, rather than having to pause operations
> > + * and wait for NTP to recover. This mode does, of course, rely on the
> > + * counter being reliable and consistent across CPUs.
> > + *
> > + * Note that this must be true UTC, never with smeared leap seconds. If a
> > + * guest wishes to construct a smeared clock, it can do so. Presenting a
> > + * smeared clock through this interface would be problematic because it
> > + * actually messes with the apparent counter *period*. A linear smearing
> > + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> > + * at the start/end of the smearing period, while a sinusoidal smear would
> > + * basically be impossible to represent.
> 
> Clock types other than UTC could also be supported: TAI, monotonic.

This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can
expose UTC only, without the TAI offset if that's unknown.

Should we also have a mode for exposing TAI only, for when TAI is known
but not the offset from UTC? Is that really a likely scenario? Isn't a
host much more likely to know UTC and *not* the TAI offset?

I suppose if we have *hardware* implementations of this, they could be
based on an atomic clock and all they'll have is TAI? So OK, maybe that
makes sense?

(We'd have to add something like the ART as the counter to pair with
over an actual PCI bus, of course.)

We can add a type field like the one you have for virtio-rtc, yes?


> 
> > + */
> > +
> > +#ifndef __VMCLOCK_H__
> > +#define __VMCLOCK_H__
> > +
> > +#ifdef __KERNEL
> > +#include <linux/types.h>
> > +#else
> > +#include <stdint.h>
> > +#endif
> > +
> > +struct vmclock_abi {
> > +       uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +       uint16_t size;          /* Size of page containing this structure */
> > +       uint16_t version;       /* 1 */
> > +
> > +       /* Sequence lock. Low bit means an update is in progress. */
> > +       uint64_t seq_count;
> > +
> > +       /*
> > +        * This field changes to another non-repeating value when the CPU
> > +        * counter is disrupted, for example on live migration.
> > +        */
> > +       uint64_t disruption_marker;
> 
> The field could also change when the clock is stepped (leap seconds
> excepted), or when the clock frequency is slewed.

I'm not sure. The concept of the disruption marker is that it tells the
guest to throw away any calibration of the counter that the guest has
done for *itself* (with NTP, other PTP devices, etc.).

One mode for this device would be not to populate the clock fields at
all, but *only* to signal disruption when it occurs. So the guest can
abort transactions until it's resynced its clocks (to avoid incurring
fines if breaking databases, etc.).

Exposing the host timekeeping through the structure means that the
migrated guest can keep working because it can trust the timekeeping
performed by the (new) host and exposed to it.

If the counter is actually varying in frequency over time, and the host
is slewing the clock frequency that it reports, that *isn't* a step
change and doesn't mean that the guest should throw away any
calibration that it's been doing for itself. One hopes that the guest
would have detected the *same* frequency change, and be adapting for
itself. So I don't think that should indicate a disruption.

I think the same is even true if the clock is stepped by the host. The
actual *counter* hasn't changed, so the guest is better off ignoring
the vacillating host and continuing to derive its idea of time from the
hardware counter itself, as calibrated against some external NTP/PTP
sources. Surely we actively *don't* to tell the guest to throw its own
calibrations away, in this case?




Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)