[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5bdb92ab83269b49ad8fbbe8f54df01f6b98ea8f.camel@infradead.org>
Date: Fri, 28 Feb 2025 11:23:41 +0000
From: David Woodhouse <dwmw2@...radead.org>
To: Sean Christopherson <seanjc@...gle.com>, Thomas Gleixner
<tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov
<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, Paolo Bonzini
<pbonzini@...hat.com>, Juergen Gross <jgross@...e.com>, "K. Y. Srinivasan"
<kys@...rosoft.com>, Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu
<wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>, Ajay Kaher
<ajay.kaher@...adcom.com>, Jan Kiszka <jan.kiszka@...mens.com>, Andy
Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Daniel
Lezcano <daniel.lezcano@...aro.org>, John Stultz <jstultz@...gle.com>
Cc: linux-kernel@...r.kernel.org, linux-coco@...ts.linux.dev,
kvm@...r.kernel.org, virtualization@...ts.linux.dev,
linux-hyperv@...r.kernel.org, xen-devel@...ts.xenproject.org, Tom Lendacky
<thomas.lendacky@....com>, Nikunj A Dadhania <nikunj@....com>
Subject: Re: [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC
On Wed, 2025-02-26 at 18:18 -0800, Sean Christopherson wrote:
> This... snowballed a bit.
>
> The bulk of the changes are in kvmclock and TSC, but pretty much every
> hypervisor's guest-side code gets touched at some point. I am reaonsably
> confident in the correctness of the KVM changes. For all other hypervisors,
> assume it's completely broken until proven otherwise.
>
> Note, I deliberately omitted:
>
> Alexey Makhalov <alexey.amakhalov@...adcom.com>
> jailhouse-dev@...glegroups.com
>
> from the To/Cc, as those emails bounced on the last version, and I have zero
> desire to get 38*2 emails telling me an email couldn't be delivered.
>
> The primary goal of this series is (or at least was, when I started) to
> fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> hypervisor is used instead of the secure/trusted TSC that is controlled by
> trusted firmware.
>
> The secondary goal is to draft off of the SNP and TDX changes to slightly
> modernize running under KVM. Currently, KVM guests will use TSC for
> clocksource, but not sched_clock. And they ignore Intel's CPUID-based TSC
> and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> guests can use that for the APIC timer period instead of manually calibrating
> the frequency.
>
> Lots more background on the SNP/TDX motiviation:
> https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com
Looks good; thanks for tackling this.
I think there are still some things from my older series at
https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/
which this doesn't address. Specifically, the accuracy and consistency
of what KVM advertises to the guest as the KVM clock. And as the Xen
clock, more to the point — because guests generally *know* that the KVM
clock is awful, but expect better of the Xen clock.
With a sane and consistent TSC, the mul/shift factors that KVM presents
to the guest in the kvmclock structure should basically *never* change.
Not even on live update (or live migration between hosts with the same
host TSC frequency).
Take live update as the simple case: serializing the QEMU state and
restarting it immediately, just to update QEMU with the guest
experiencing only a few milliseconds of steal time.
The guest TSC has a fixed arithmetic relationship to the host TSC. That
should *not* change across the live update; not by a single count.
I don't believe the KVM APIs allow userspace to get that right, which
is resolved by the KVM_VCPU_TSC_SCALE ioctl in patch 7 of that series:
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
And then the KVM clock should have a fixed arithmetic relationship to
the guest TSC, which should *also* not change. Not even over live
migration — userspace should ensure the guest TSC is as accurate as
possible given NTP synchronisation between the hosts, and then the KVM
clock remains a fixed function of the guest TSC (at least, if the guest
TSC is the same frequency on source and destination). The existing KVM
API doesn't allow userspace to get *that* right either, which is
addressed by Jack's patch 3 of the series:
https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
The rest of the series is mostly fixing a bunch of places where KVM
gratuitously recalculates the KVM clock that it advertises to the
guest, and the fact that it does so *badly* in some cases, with a loss
of precision that causes errors in the guest. You may already have
addressed some of those; I'll go over my series and see what still
applies on top of yours.
>
> v2:
> - Add struct to hold the TSC CPUID output. [Boris]
> - Don't pointlessly inline the TSC CPUID helpers. [Boris]
> - Fix a variable goof in a helper, hopefully for real this time. [Dan]
> - Collect reviews. [Nikunj]
> - Override the sched_clock save/restore hooks if and only if a PV clock
> is successfully registered.
> - During resome, restore clocksources before reading persistent time.
> - Clean up more warts created by kvmclock.
> - Fix more bugs in kvmclock's suspend/resume handling.
> - Try to harden kvmclock against future bugs.
>
> v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@google.com
>
> Sean Christopherson (38):
> x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
> x86/tsc: Add standalone helper for getting CPU frequency from CPUID
> x86/tsc: Add helper to register CPU and TSC freq calibration routines
> x86/sev: Mark TSC as reliable when configuring Secure TSC
> x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
> x86/tdx: Override PV calibration routines with CPUID-based calibration
> x86/acrn: Mark TSC frequency as known when using ACRN for calibration
> clocksource: hyper-v: Register sched_clock save/restore iff it's
> necessary
> clocksource: hyper-v: Drop wrappers to sched_clock save/restore
> helpers
> clocksource: hyper-v: Don't save/restore TSC offset when using HV
> sched_clock
> x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
> x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
> x86/paravirt: Move handling of unstable PV clocks into
> paravirt_set_sched_clock()
> x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
> x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
> x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
> x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
> x86/paravirt: Pass sched_clock save/restore helpers during
> registration
> x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
> x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
> x86/pvclock: Mark setup helpers and related various as
> __init/__ro_after_init
> x86/pvclock: WARN if pvclock's valid_flags are overwritten
> x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during
> kvmclock_init()
> timekeeping: Resume clocksources before reading persistent clock
> x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't
> sched_clock
> x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
> x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't
> sched_clock
> x86/paravirt: Mark __paravirt_set_sched_clock() as __init
> x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
> x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted
> TSC
> x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
> x86/tsc: Rejects attempts to override TSC calibration with lesser
> routine
> x86/kvmclock: Mark TSC as reliable when it's constant and nonstop
> x86/kvmclock: Get CPU base frequency from CPUID when it's available
> x86/kvmclock: Get TSC frequency from CPUID when its available
> x86/kvmclock: Stuff local APIC bus period when core crystal freq comes
> from CPUID
> x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
> x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
>
> arch/x86/coco/sev/core.c | 9 +-
> arch/x86/coco/tdx/tdx.c | 27 ++-
> arch/x86/include/asm/kvm_para.h | 10 +-
> arch/x86/include/asm/paravirt.h | 16 +-
> arch/x86/include/asm/tdx.h | 2 +
> arch/x86/include/asm/tsc.h | 20 +++
> arch/x86/include/asm/x86_init.h | 2 -
> arch/x86/kernel/cpu/acrn.c | 5 +-
> arch/x86/kernel/cpu/mshyperv.c | 69 +-------
> arch/x86/kernel/cpu/vmware.c | 11 +-
> arch/x86/kernel/jailhouse.c | 6 +-
> arch/x86/kernel/kvm.c | 39 +++--
> arch/x86/kernel/kvmclock.c | 260 +++++++++++++++++++++--------
> arch/x86/kernel/paravirt.c | 35 +++-
> arch/x86/kernel/pvclock.c | 9 +-
> arch/x86/kernel/smpboot.c | 2 +-
> arch/x86/kernel/tsc.c | 141 ++++++++++++----
> arch/x86/kernel/x86_init.c | 1 -
> arch/x86/mm/mem_encrypt_amd.c | 3 -
> arch/x86/xen/time.c | 13 +-
> drivers/clocksource/hyperv_timer.c | 38 +++--
> include/clocksource/hyperv_timer.h | 2 -
> kernel/time/timekeeping.c | 9 +-
> 23 files changed, 487 insertions(+), 242 deletions(-)
>
>
> base-commit: a64dcfb451e254085a7daee5fe51bf22959d52d3
Download attachment "smime.p7s" of type "application/pkcs7-signature" (5069 bytes)
Powered by blists - more mailing lists