linux-kernel - Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <33907a83-4e1a-f121-74f3-bde1e68b047c@oracle.com>
Date:   Tue, 7 Nov 2023 17:43:16 -0800
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     Paul Durrant <paul@....org>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

Hi David,

On 11/7/23 15:24, David Woodhouse wrote:
> On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
>> Thank you very much for the detailed explanation.
>>
>> I agree it is important to resolve the "now" problem. I guess the KVM lapic
>> deadline timer has the "now" problem as well.
> 
> I think so. And quite gratuitously so, since it just does:
> 
> 	now = ktime_get();
> 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 
> 
> Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?

The core idea is to always capture the pair of (tsc, ns) at exactly the same
time point.

I have no idea how much accuracy it can improve, considering the extra costs to
inject the timer interrupt into the vCPU.

> 
> Thankfully, it's defined in the time domain of the guest TSC, not the
> kvmclock, so it doesn't suffer the same drift issue as the Xen timer.
> 
>> I just notice my question missed a key prerequisite:
>>
>> Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?
>>
>> While it is the absolute nanosecond value at the VM side, on which time domain
>> it is based?
> 
> It's the kvmclock. Xen offers as Xen PV clock to its guests using
> *precisely* the same pvclock structure as KVM does. 
> 
> 
>> 1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?
>>
>> 2. Is oneshot.timeout_abs_ns based on tsc from VM side?
>>
>> 3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?
>>
>> 4. Or it is based on wallclock?
>>
>> I think the OS does not have a concept of nanoseconds. It is derived from a
>> clocksource.
> 
> It's the kvmclock.

Thank you very much!

Both xen clock and kvm clock are pvclock based on the same equations.

> 
> The guest derives it from the guest TSC using the pvclock information
> (mul/shift/offset) that KVM provides to the guest.
> 
> The kvm_setup_guest_pvclock() function is potentially called *three*
> times from kvm_guest_time_update(). Once for the KVM pv time MSR, once
> for the pvclock structure in the Xen vcpu_info, and finally for the
> pvclock structure which Xen makes available to userspace for vDSO
> timekeeping.
> 
>> If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
>> both pvclock and timer are per-vCPU.
> 
> Yes, it is per-vCPU. Although in the sane case the TSCs on all vCPUs
> will match and the mul/shift/offset provided by KVM won't actually
> differ. Even in the insane case where guest TSCs are out of sync,
> surely the pvclock information will differ only in order to ensure that
> the *result* in nanoseconds does not?
> 
> I conveniently ducked this question in my patch by only supporting the
> CONSTANT_TSC case, and not the case where we happen to know the
> (potentially different) TSC frequencies on all the different pCPUs and
> vCPUs.

This is also my question that why to support only the CONSTANT_TSC case.

For the lapic timer case:

The timer is always calculated based on the *current* vCPU's tsc virtualization,
regardless CONSTANT_TSC or not.

For the xen timer case:

Why not always calculate the expire based on the *current* vCPU's time
virtualization? That is, why not always use the current vCPU's hv_clock,
regardless CONSTANT_TSC/masteclock?

That is: kvm lapic method with kvm_get_monotonic_and_clockread().

> 
> 
>>
>> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
>> tsc value, (2)on the current vCPU.
>>
>>
>> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
>> 1950 {
>> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
>> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
>> 1953         u64 ns = 0;
>> 1954         ktime_t expire;
>> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
>> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
>> 1957         unsigned long flags;
>> 1958         ktime_t now;
>> 1959
>> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
>> 1961                 return;
>> 1962
>> 1963         local_irq_save(flags);
>> 1964
>> 1965         now = ktime_get();
>> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> 1967
>> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
>> 1969         do_div(ns, this_tsc_khz);
>>
>>
>> Sorry if I make the question very confusing. The core question is: where and
>> from which clocksource the abs nanosecond value is from? What will happen if the
>> Xen VM uses HPET as clocksource, while xen timer as clock event?
> 
> If the guest uses HPET as clocksource and Xen timer as clockevents,
> then keeping itself in sync is the *guest's* problem. The Xen timer is
> defined in terms of nanoseconds since guest start, as provided in the
> pvclock information described above. Hope that helps!
>

The "in terms of nanoseconds since guest start" refers to *one* global value.
Should we use wallclock when we are referring to a global value shared by all vCPUs?


Based on the following piece of code, I do not think we may assume all vCPUs
have the same pvclock at the same time point: line 104-108, when
PVCLOCK_TSC_STABLE_BIT is not set.


 67 static __always_inline
 68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
 69 {
 70         unsigned version;
 71         u64 ret;
 72         u64 last;
 73         u8 flags;
 74
 75         do {
 76                 version = pvclock_read_begin(src);
 77                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
 78                 flags = src->flags;
 79         } while (pvclock_read_retry(src, version));
... ...
104         last = raw_atomic64_read(&last_value);
105         do {
106                 if (ret <= last)
107                         return last;
108         } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
109
110         return ret;
111 }


That's why I appreciate a definition of the abs nanoseconds used by the xen
timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
is not set.


Thank you very much!

Dongli Zhang