linux-kernel - Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19f8de0a-17f7-1a25-f2e9-adbf00ecb035@oracle.com>
Date:   Tue, 7 Nov 2023 15:07:44 -0800
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     David Woodhouse <dwmw2@...radead.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     Paul Durrant <paul@....org>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

Hi David,

On 11/7/23 00:17, David Woodhouse wrote:
> On Mon, 2023-11-06 at 17:44 -0800, Dongli Zhang wrote:
>>> +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
>>> +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
>>
>> If there any reason to use both vcpu->kvm->arch.use_master_clock and
>> X86_FEATURE_CONSTANT_TSC?
> 
> Er, paranoia? I'll recheck.
> 
>> I think even __get_kvmclock() would not require both cases at the same time?
>>
>>  3071         if (ka->use_master_clock &&
>>  3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
>>
> 
> But it does. That requires ka->use_master_clock (line 3071) AND that we
> know the current CPU's TSC frequency (line 3072).
> 
> My code insists on the CONSTANT_TSC form of "knowing the current CPU's
> TSC frequency" because even with a get_cpu(), it's not clear the guest
> *was* running on this vCPU when it did its calculations. So I don't
> want to go anywhere near the !CONSTANT_TSC case; it can use the
> fallback.
> 
> 
>>> +       } else {
>>> +               /*
>>> +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
>>> +                *
>>> +                * Also if the guest PV clock hasn't been set up yet, as is
>>> +                * likely to be the case during migration when the vCPU has
>>> +                * not been run yet. It would be possible to calculate the
>>> +                * scaling factors properly in that case but there's not much
>>> +                * point in doing so. The get_kvmclock_ns() drift accumulates
>>> +                * over time, so it's OK to use it at startup. Besides, on
>>> +                * migration there's going to be a little bit of skew in the
>>> +                * precise moment at which timers fire anyway. Often they'll
>>> +                * be in the "past" by the time the VM is running again after
>>> +                * migration.
>>> +                */
>>> +               guest_now = get_kvmclock_ns(vcpu->kvm);
>>> +               kernel_now = ktime_get();
>>
>> 1. Can I assume the issue is still there if we fall into the "else" case? That
>> is, the increasing inaccuracy as the VM has been up for longer and longer time?
>>
>> If that is the case, which may be better?
>>
>> (1) get_kvmclock_ns(), or
>> (2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
>> enabled, regardless whether master clock is used. At least, the inaccurary is
>> not going to increase over guest time?
> 
> No, those are both wrong, and drifting further away over time. They are
> each *differently* wrong, which is why periodically clamping (1) to (2)
> is also broken, as previously discussed. I know you've got a patch to
> do that clamping more *often* which would slightly reduce the pain
> because the kvmclock wouldn't jump backwards so *far* each time... but
> it's still wrong to do so at all (in either direction).
> 
>>
>> 2. I see 3 scenarios here:
>>
>> (1) vcpu->arch.hv_clock.version and master clock is used.
>>
>> In this case, the bugfix looks good.
>>
>> (2) The master clock is used. However, pv clock is not enabled.
>>
>> In this case, the bug is not resolved? ... even the master clock is used.
> 
> Under Xen the PV clock is always enabled. It's in the vcpu_info
> structure which is required for any kind of event channel setup.
> 
>>
>> (3) The master clock is not used.
>>
>> We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
>> changed. This looks good.
>>
>>
>> Just from my own point: as this patch involves relatively complex changes, I
>> would suggest resolve the issue, but not use a temporary solution :)
>>
> 
> This is the conversation I had with Paul on Tuesday, when he wanted me
> to fix up this "(3) / behaviour is not changed" case. And yes, I argued
> that we *don't* want a temporary solution for this case. Because yes:
> 
>> (I see you mentioned that you will be back with get_kvmclock_ns())
> 
> We need to fix get_kvmclock_ns() anyway. The systemic drift *and* the
> fact that we periodically clamp it to a different clock and make it
> jump. I was working on the former and have something half-done but was
> preempted by the realisation that the QEMU soft freeze is today, and I
> needed to flush my QEMU patch queue.
> 
> But even once we fix get_kvmclock_ns(), *this* patch stands. Because it
> *also* addresses the "now" problem, where we get the time by one clock
> ... and then some time passes ... and we get the time by another clock,
> and subtract one from the other as if they were the *same* time.
> 
> Using kvm_get_monotonic_and_clockread() gives us a single TSC read
> corresponding to the CLOCK_MONOTONIC time, from which we can calculate
> the kvmclock time. We just *happen* to calculate it correctly here,
> unlike anywhere else in KVM.
> 
>> Based on your bug fix, I see the below cases:
>>
>> If master clock is not used:
>>     get_kvmclock_base_ns() + ka->kvmclock_offset
>>
>> If master clock is used:
>>     If pvclock is enabled:
>>         use the &vcpu->arch.hv_clock to get current guest time
>>     Else
>>         create a temporary hv_clock, based on masterclock.
> 
> I don't want to do that last 'else' clause yet, because that feels like
> a temporary workaround. It should be enough to call get_kvmclock_ns(),
> once we fix it.
> 
> 
> 

Thank you very much for the detailed explanation.

I agree it is important to resolve the "now" problem. I guess the KVM lapic
deadline timer has the "now" problem as well.


I just notice my question missed a key prerequisite:

Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?

While it is the absolute nanosecond value at the VM side, on which time domain
it is based?

1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?

2. Is oneshot.timeout_abs_ns based on tsc from VM side?

3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?

4. Or it is based on wallclock?

I think the OS does not have a concept of nanoseconds. It is derived from a
clocksource.



If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
both pvclock and timer are per-vCPU.


E.g., according to the KVM lapic deadline timer, all values are based on (1) the
tsc value, (2)on the current vCPU.


1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
1950 {
1951         struct kvm_timer *ktimer = &apic->lapic_timer;
1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
1953         u64 ns = 0;
1954         ktime_t expire;
1955         struct kvm_vcpu *vcpu = apic->vcpu;
1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
1957         unsigned long flags;
1958         ktime_t now;
1959
1960         if (unlikely(!tscdeadline || !this_tsc_khz))
1961                 return;
1962
1963         local_irq_save(flags);
1964
1965         now = ktime_get();
1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
1967
1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
1969         do_div(ns, this_tsc_khz);


Sorry if I make the question very confusing. The core question is: where and
from which clocksource the abs nanosecond value is from? What will happen if the
Xen VM uses HPET as clocksource, while xen timer as clock event?

Regardless the clocksource, KVM VM may always use current vCPU's tsc in the
lapic deadline timer.

Thank you very much!

Dongli Zhang