[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8479764-34a1-334f-3865-c01325d772d9@oracle.com>
Date: Mon, 2 Oct 2023 10:17:13 -0700
From: Dongli Zhang <dongli.zhang@...cle.com>
To: Sean Christopherson <seanjc@...gle.com>,
David Woodhouse <dwmw2@...radead.org>
Cc: Joe Jin <joe.jin@...cle.com>, x86@...nel.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, pbonzini@...hat.com,
tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com
Subject: Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock
periodically
Hi Sean and David,
On 10/2/23 09:37, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>
>>>
>>> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.
>>>
>>> This is because:
>>>
>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>
>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>
>>> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
>>> different results (this is expected because they have different
>>> mult/shift/equation).
>>>
>>> 4. However, the base inĀ kvmclock calculation (tsc_timestamp and system_time)
>>> are derived from raw monotonic clock (master clock)
>>
>> That just seems wrong. I don't mean that you're incorrect; it seems
>> *morally* wrong.
>>
>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>
>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>
>> Fix that, and the whole problem goes away, doesn't it?
>>
>> What am I missing here, that means we can't do that?
>
> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
> ABI between KVM and KVM guests.
>
> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
> of making things kinda sorta work with old hardware, i.e. was probably the least
> awful solution in the days before constant TSCs, but is completely nonsensical on
> modern hardware.
>
>> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all?
>> If KVM wants to decide that the TSC runs at a different frequency to
>> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM
>> just *stick* to that?
>
> Yeah, bouncing around guest time when the TSC is constant seems counterproductive.
>
> However, why does any of this matter if the host has a constant TSC? If that's
> the case, a sane setup will expose a constant TSC to the guest and the guest will
> use the TSC instead of kvmclock for the guest clocksource.
>
> Dongli, is this for long-lived "legacy" guests that were created on hosts without
> a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are
> you running on hardware without a constant TSC? :-)
This is for test guests, and the host has all of below:
tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust
A clocksource is used for two things.
1. current_clocksource. Yes, I agree we should always use tsc on modern hardware.
Do we need to update the documentation to always suggest TSC when it is
constant, as I believe many users still prefer pv clock than tsc?
Thanks to tsc ratio scaling, the live migration will not impact tsc.
>From the source code, the rating of kvm-clock is still higher than tsc.
BTW., how about to decrease the rating if guest detects constant tsc?
166 struct clocksource kvm_clock = {
167 .name = "kvm-clock",
168 .read = kvm_clock_get_cycles,
169 .rating = 400,
170 .mask = CLOCKSOURCE_MASK(64),
171 .flags = CLOCK_SOURCE_IS_CONTINUOUS,
172 .enable = kvm_cs_enable,
173 };
1196 static struct clocksource clocksource_tsc = {
1197 .name = "tsc",
1198 .rating = 300,
1199 .read = read_tsc,
2. The sched_clock.
The scheduling is impacted if there is big drift.
Fortunately, according to my general test (without RT requirement), the impact
is trivial unless the two masterclock *updates* are between a super long period.
In the past, the scheduling was impacted a lot when the masterclock was still
based on monotonic (not raw).
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53fafdbb8b21fa99dfd8376ca056bffde8cafc11
Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
operations (not only sched_clock), e.g., after line 300.
296 void __init kvmclock_init(void)
297 {
298 u8 flags;
299
300 if (!kvm_para_available() || !kvmclock)
301 return;
302
303 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
304 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
305 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
306 } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
307 return;
308 }
309
310 if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
311 kvmclock_setup_percpu, NULL) < 0) {
312 return;
313 }
314
315 pr_info("kvm-clock: Using msrs %x and %x",
316 msr_kvm_system_time, msr_kvm_wall_clock);
317
318 this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
319 kvm_register_clock("primary cpu clock");
320 pvclock_set_pvti_cpu0_va(hv_clock_boot);
321
322 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
323 pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
324
325 flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
326 kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
327
328 x86_platform.calibrate_tsc = kvm_get_tsc_khz;
329 x86_platform.calibrate_cpu = kvm_get_tsc_khz;
330 x86_platform.get_wallclock = kvm_get_wallclock;
331 x86_platform.set_wallclock = kvm_set_wallclock;
332 #ifdef CONFIG_X86_LOCAL_APIC
333 x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
334 #endif
335 x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
336 x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
337 kvm_get_preset_lpj();
Should I introduce a new param to disable no-kvm-sched-clock only, or to
introduce a new param to allow the selection of sched_clock?
Those may not resolve the issue in another thread. (I guess there is a chance
that to refresh the masterclock may reduce the drift in that case, although
never tried)
https://lore.kernel.org/all/00fba193-238e-49dc-fdc4-0b93f20569ec@oracle.com/
Thank you very much!
Dongli Zhang
>
> Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact
> problematic configuration(s) will help us make a better decision on how to fix
> the mess.
Powered by blists - more mailing lists