linux-kernel - Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5e0598c86361570674401f43191c3f819a6b32d2.camel@infradead.org>
Date:   Wed, 08 Nov 2023 15:25:29 +0000
From:   David Woodhouse <dwmw2@...radead.org>
To:     Dongli Zhang <dongli.zhang@...cle.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     Paul Durrant <paul@....org>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

On Tue, 2023-11-07 at 17:43 -0800, Dongli Zhang wrote:
> Hi David,
> 
> On 11/7/23 15:24, David Woodhouse wrote:
> > On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
> > > Thank you very much for the detailed explanation.
> > > 
> > > I agree it is important to resolve the "now" problem. I guess the KVM lapic
> > > deadline timer has the "now" problem as well.
> > 
> > I think so. And quite gratuitously so, since it just does:
> > 
> >         now = ktime_get();
> >         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > 
> > 
> > Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?
> 
> The core idea is to always capture the pair of (tsc, ns) at exactly the same
> time point.
> 
> I have no idea how much accuracy it can improve, considering the extra costs to
> inject the timer interrupt into the vCPU.

Right. It's probably in the noise most of the time, unless you're
unlucky enough to get preempted between the two TSC reads which are
supposed to be happening "at the same time".
> 

> > I conveniently ducked this question in my patch by only supporting the
> > CONSTANT_TSC case, and not the case where we happen to know the
> > (potentially different) TSC frequencies on all the different pCPUs and
> > vCPUs.
> 
> This is also my question that why to support only the CONSTANT_TSC case.
> 
> For the lapic timer case:
> 
> The timer is always calculated based on the *current* vCPU's tsc virtualization,
> regardless CONSTANT_TSC or not.
> 
> For the xen timer case:
> 
> Why not always calculate the expire based on the *current* vCPU's time
> virtualization? That is, why not always use the current vCPU's hv_clock,
> regardless CONSTANT_TSC/masteclock?

The simple answer is because I wasn't sure it would work correctly in
all cases, and didn't *care* enough about the non-CONSTANT_TSC case to
prove it to myself.

Let's think about it...

In the non-CONSTANT_TSC case, each physical CPU can have a different
TSC frequency, yes? And KVM has a cpufreq notifier which triggers when
the TSC changes, and make a KVM_REQ_CLOCK_UPDATE request to any vCPU
running on the affected pCPU. With an associated IPI to ensure the vCPU
exits guest mode and will processes the update before executing any
further guest code.

If a vCPU had *previously* been running on the affected pCPU but wasn't
running when the notifier happened, then kvm_arch_vcpu_load() will make
a KVM_REQ_GLOBAL_CLOCK_UPDATE request, which will immediately update
the vCPU in question, and then trigger a deferred KVM_REQ_CLOCK_UPDATE
for the others.

So the vCPU itself, in guest mode, is always going to see *correct*
pvclock information corresponding to the pCPU it is running on at the
time.

(I *believe* the way this works is that when a vCPU runs on a pCPU
which has a TSC frequency lower than the vCPU should have, it runs in
'always catchup' mode. Where the TSC offset is bumped *every* time the
vCPU enters guest mode, so the TSC is about right on every entry, might
seem to run a little slow if the vCPU does a tight loop of rdtsc, but
will catch up again on next vmexit/entry?)

But we aren't talking about the vCPU running in guest mode. The code in
kvm_xen_start_timer() and in start_sw_tscdeadline() is running in the
host kernel. How can we be sure that it's running on the *same*
physical CPU that the vCPU was previously running on, and thus how can
we be sure that the vcpu->arch.hv_clock is valid with respect to a
rdtsc on the current pCPU? I don't know that we can know that.

As far as I can tell, the code in start_sw_tscdeadline() makes no
attempt to do the 'catchup' thing, and just converts the pCPU's TSC to
guest TSC using kvm_read_l1_tsc() — which uses a multiplier that's set
once and *never* recalculated when the host TSC frequency changes.

On the whole, now I *have* thought about it, I'm even more convinced I
was right in the first place that I didn't want to know :)

I think I stand by my original decision that the Xen timer code in the
non-CONSTANT_TSC case can just use get_kvmclock_ns(). The "now" problem
is going to be in the noise if the TSC isn't constant anyway, and we
need to fix the drift and jumps of get_kvmclock_ns() *anyway* rather
than adding a temporary special case for the Xen timers.

> That is: kvm lapic method with kvm_get_monotonic_and_clockread().
> 
> > 
> > 
> > > 
> > > E.g., according to the KVM lapic deadline timer, all values are based on (1) the
> > > tsc value, (2)on the current vCPU.
> > > 
> > > 
> > > 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > > 1950 {
> > > 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
> > > 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
> > > 1953         u64 ns = 0;
> > > 1954         ktime_t expire;
> > > 1955         struct kvm_vcpu *vcpu = apic->vcpu;
> > > 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
> > > 1957         unsigned long flags;
> > > 1958         ktime_t now;
> > > 1959
> > > 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
> > > 1961                 return;
> > > 1962
> > > 1963         local_irq_save(flags);
> > > 1964
> > > 1965         now = ktime_get();
> > > 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > > 1967
> > > 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
> > > 1969         do_div(ns, this_tsc_khz);
> > > 
> > > 
> > > Sorry if I make the question very confusing. The core question is: where and
> > > from which clocksource the abs nanosecond value is from? What will happen if the
> > > Xen VM uses HPET as clocksource, while xen timer as clock event?
> > 
> > If the guest uses HPET as clocksource and Xen timer as clockevents,
> > then keeping itself in sync is the *guest's* problem. The Xen timer is
> > defined in terms of nanoseconds since guest start, as provided in the
> > pvclock information described above. Hope that helps!
> > 
> 
> The "in terms of nanoseconds since guest start" refers to *one* global value.
> Should we use wallclock when we are referring to a global value shared by all vCPUs?
> 
> 
> Based on the following piece of code, I do not think we may assume all vCPUs
> have the same pvclock at the same time point: line 104-108, when
> PVCLOCK_TSC_STABLE_BIT is not set.
> 

The *result* of calculating the pvclock should be the same on all vCPUs
at any given moment in time.

The precise *calculation* may differ, depending on the frequency of the
TSC for that particular vCPU and the last time the pvclock information
was created for that vCPU.


> 
>  67 static __always_inline
>  68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
>  69 {
>  70         unsigned version;
>  71         u64 ret;
>  72         u64 last;
>  73         u8 flags;
>  74
>  75         do {
>  76                 version = pvclock_read_begin(src);
>  77                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
>  78                 flags = src->flags;
>  79         } while (pvclock_read_retry(src, version));
> ... ...
> 104         last = raw_atomic64_read(&last_value);
> 105         do {
> 106                 if (ret <= last)
> 107                         return last;
> 108         } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
> 109
> 110         return ret;
> 111 }
> 
> 
> That's why I appreciate a definition of the abs nanoseconds used by the xen
> timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
> global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
> is not set.

It is only per-vCPU if the vCPUs have *different* TSC frequencies.
That's because of the scaling; the guest calculates the nanoseconds
from the *guest* TSC of course, scaled according to the pvclock
information given to the guest by KVM.

As discussed and demonstrated by http://david.woodhou.se/tsdrift.c , if
KVM scales directly to nanoseconds from the *host* TSC at its known
frequency, that introduces a systemic drift between what the guest
calculates, and what KVM calculates — even in the CONSTANT_TSC case.

How do we reconcile the two? Well, it makes no sense for the definition
of the pvclock to be something that the guest *cannot* calculate, so
obviously KVM must do the same calculations the guest does; scale to
the guest TSC (kvm_read_l1_tsc()) and then apply the same pvclock
information from vcpu->arch.hvclock to get the nanoseconds.

In the sane world where the guest vCPUs all have the *same* TSC
frequency, that's fine. The kvmclock isn't *really* per-vCPU because
they're all the same. 

If the VMM sets up different vCPUs to have different TSC frequencies
then yes, their kvmclock will drift slightly apart over time. That
might be the *one* case where I will accept that the guest pvclock
might ever change, even in the CONSTANT_TSC environment (without host
suspend or any other nonsense).

In that patch I started typing on Monday and *still* haven't got round
to finishing because other things keep catching fire, I'm using the
*KVM-wide* guest TSC frequency as the definition for the kvmclock.



Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)