linux-kernel - Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6cfae1a-e687-ae7d-12ae-38fb97923082@oracle.com>
Date:   Wed, 8 Nov 2023 10:22:51 -0800
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     Paul Durrant <paul@....org>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

Hi David,

On 11/8/23 07:25, David Woodhouse wrote:
> On Tue, 2023-11-07 at 17:43 -0800, Dongli Zhang wrote:
>> Hi David,
>>
>> On 11/7/23 15:24, David Woodhouse wrote:
>>> On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
>>>> Thank you very much for the detailed explanation.
>>>>
>>>> I agree it is important to resolve the "now" problem. I guess the KVM lapic
>>>> deadline timer has the "now" problem as well.
>>>
>>> I think so. And quite gratuitously so, since it just does:
>>>
>>>         now = ktime_get();
>>>         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>>
>>>
>>> Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?
>>
>> The core idea is to always capture the pair of (tsc, ns) at exactly the same
>> time point.
>>
>> I have no idea how much accuracy it can improve, considering the extra costs to
>> inject the timer interrupt into the vCPU.
> 
> Right. It's probably in the noise most of the time, unless you're
> unlucky enough to get preempted between the two TSC reads which are
> supposed to be happening "at the same time".
>>
> 
>>> I conveniently ducked this question in my patch by only supporting the
>>> CONSTANT_TSC case, and not the case where we happen to know the
>>> (potentially different) TSC frequencies on all the different pCPUs and
>>> vCPUs.
>>
>> This is also my question that why to support only the CONSTANT_TSC case.
>>
>> For the lapic timer case:
>>
>> The timer is always calculated based on the *current* vCPU's tsc virtualization,
>> regardless CONSTANT_TSC or not.
>>
>> For the xen timer case:
>>
>> Why not always calculate the expire based on the *current* vCPU's time
>> virtualization? That is, why not always use the current vCPU's hv_clock,
>> regardless CONSTANT_TSC/masteclock?
> 
> The simple answer is because I wasn't sure it would work correctly in
> all cases, and didn't *care* enough about the non-CONSTANT_TSC case to
> prove it to myself.
> 
> Let's think about it...
> 
> In the non-CONSTANT_TSC case, each physical CPU can have a different
> TSC frequency, yes? And KVM has a cpufreq notifier which triggers when
> the TSC changes, and make a KVM_REQ_CLOCK_UPDATE request to any vCPU
> running on the affected pCPU. With an associated IPI to ensure the vCPU
> exits guest mode and will processes the update before executing any
> further guest code.
> 
> If a vCPU had *previously* been running on the affected pCPU but wasn't
> running when the notifier happened, then kvm_arch_vcpu_load() will make
> a KVM_REQ_GLOBAL_CLOCK_UPDATE request, which will immediately update
> the vCPU in question, and then trigger a deferred KVM_REQ_CLOCK_UPDATE
> for the others.
> 
> So the vCPU itself, in guest mode, is always going to see *correct*
> pvclock information corresponding to the pCPU it is running on at the
> time.
> 
> (I *believe* the way this works is that when a vCPU runs on a pCPU
> which has a TSC frequency lower than the vCPU should have, it runs in
> 'always catchup' mode. Where the TSC offset is bumped *every* time the
> vCPU enters guest mode, so the TSC is about right on every entry, might
> seem to run a little slow if the vCPU does a tight loop of rdtsc, but
> will catch up again on next vmexit/entry?)
> 
> But we aren't talking about the vCPU running in guest mode. The code in
> kvm_xen_start_timer() and in start_sw_tscdeadline() is running in the
> host kernel. How can we be sure that it's running on the *same*
> physical CPU that the vCPU was previously running on, and thus how can
> we be sure that the vcpu->arch.hv_clock is valid with respect to a
> rdtsc on the current pCPU? I don't know that we can know that.
> 
> As far as I can tell, the code in start_sw_tscdeadline() makes no
> attempt to do the 'catchup' thing, and just converts the pCPU's TSC to
> guest TSC using kvm_read_l1_tsc() — which uses a multiplier that's set
> once and *never* recalculated when the host TSC frequency changes.
> 
> On the whole, now I *have* thought about it, I'm even more convinced I
> was right in the first place that I didn't want to know :)
> 
> I think I stand by my original decision that the Xen timer code in the
> non-CONSTANT_TSC case can just use get_kvmclock_ns(). The "now" problem
> is going to be in the noise if the TSC isn't constant anyway, and we
> need to fix the drift and jumps of get_kvmclock_ns() *anyway* rather
> than adding a temporary special case for the Xen timers.
> 
>> That is: kvm lapic method with kvm_get_monotonic_and_clockread().
>>
>>>
>>>
>>>>
>>>> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
>>>> tsc value, (2)on the current vCPU.
>>>>
>>>>
>>>> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
>>>> 1950 {
>>>> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
>>>> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
>>>> 1953         u64 ns = 0;
>>>> 1954         ktime_t expire;
>>>> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
>>>> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
>>>> 1957         unsigned long flags;
>>>> 1958         ktime_t now;
>>>> 1959
>>>> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
>>>> 1961                 return;
>>>> 1962
>>>> 1963         local_irq_save(flags);
>>>> 1964
>>>> 1965         now = ktime_get();
>>>> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>>> 1967
>>>> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
>>>> 1969         do_div(ns, this_tsc_khz);
>>>>
>>>>
>>>> Sorry if I make the question very confusing. The core question is: where and
>>>> from which clocksource the abs nanosecond value is from? What will happen if the
>>>> Xen VM uses HPET as clocksource, while xen timer as clock event?
>>>
>>> If the guest uses HPET as clocksource and Xen timer as clockevents,
>>> then keeping itself in sync is the *guest's* problem. The Xen timer is
>>> defined in terms of nanoseconds since guest start, as provided in the
>>> pvclock information described above. Hope that helps!
>>>
>>
>> The "in terms of nanoseconds since guest start" refers to *one* global value.
>> Should we use wallclock when we are referring to a global value shared by all vCPUs?
>>
>>
>> Based on the following piece of code, I do not think we may assume all vCPUs
>> have the same pvclock at the same time point: line 104-108, when
>> PVCLOCK_TSC_STABLE_BIT is not set.
>>
> 
> The *result* of calculating the pvclock should be the same on all vCPUs
> at any given moment in time.
> 
> The precise *calculation* may differ, depending on the frequency of the
> TSC for that particular vCPU and the last time the pvclock information
> was created for that vCPU.
> 
> 
>>
>>  67 static __always_inline
>>  68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
>>  69 {
>>  70         unsigned version;
>>  71         u64 ret;
>>  72         u64 last;
>>  73         u8 flags;
>>  74
>>  75         do {
>>  76                 version = pvclock_read_begin(src);
>>  77                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
>>  78                 flags = src->flags;
>>  79         } while (pvclock_read_retry(src, version));
>> ... ...
>> 104         last = raw_atomic64_read(&last_value);
>> 105         do {
>> 106                 if (ret <= last)
>> 107                         return last;
>> 108         } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
>> 109
>> 110         return ret;
>> 111 }
>>
>>
>> That's why I appreciate a definition of the abs nanoseconds used by the xen
>> timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
>> global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
>> is not set.
> 
> It is only per-vCPU if the vCPUs have *different* TSC frequencies.
> That's because of the scaling; the guest calculates the nanoseconds
> from the *guest* TSC of course, scaled according to the pvclock
> information given to the guest by KVM.
> 
> As discussed and demonstrated by http://david.woodhou.se/tsdrift.c , if
> KVM scales directly to nanoseconds from the *host* TSC at its known
> frequency, that introduces a systemic drift between what the guest
> calculates, and what KVM calculates — even in the CONSTANT_TSC case.
> 
> How do we reconcile the two? Well, it makes no sense for the definition
> of the pvclock to be something that the guest *cannot* calculate, so
> obviously KVM must do the same calculations the guest does; scale to
> the guest TSC (kvm_read_l1_tsc()) and then apply the same pvclock
> information from vcpu->arch.hvclock to get the nanoseconds.
> 
> In the sane world where the guest vCPUs all have the *same* TSC
> frequency, that's fine. The kvmclock isn't *really* per-vCPU because
> they're all the same. 
> 
> If the VMM sets up different vCPUs to have different TSC frequencies
> then yes, their kvmclock will drift slightly apart over time. That
> might be the *one* case where I will accept that the guest pvclock
> might ever change, even in the CONSTANT_TSC environment (without host
> suspend or any other nonsense).
> 
> In that patch I started typing on Monday and *still* haven't got round
> to finishing because other things keep catching fire, I'm using the
> *KVM-wide* guest TSC frequency as the definition for the kvmclock.
> 
> 

Thank you very much for the explanation.

I understand you may use different methods to obtain the 'expire' under
different cases.

Maybe add some comments in the KVM code of xen timer emulation? E.g.:

- When the TSC is reliable, follow the standard/protocol that xen timer is
per-vCPU pvclock based: that is, to always scale host_tsc with kvm_read_l1_tsc().

- However, sometimes TSC is not reliable. Use the legacy method get_kvmclock_ns().

This may help developers understand the standard/protocol used by xen timer. The
core idea will be: the implementation is trying to following the xen timer
nanoseconds definition (per-vCPU pvclock), and it may use other legacy solution
under special case, in order to improve the accuracy.

TBH, I never think about what the definition of nanosecond is in xen timer (even
I used to and I am still working on some xen issue).

Thank you very much!

Dongli Zhang