linux-kernel - Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8479764-34a1-334f-3865-c01325d772d9@oracle.com>
Date:   Mon, 2 Oct 2023 10:17:13 -0700
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        David Woodhouse <dwmw2@...radead.org>
Cc:     Joe Jin <joe.jin@...cle.com>, x86@...nel.org, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com
Subject: Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock
 periodically

Hi Sean and David,

On 10/2/23 09:37, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>
>>>
>>> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.
>>>
>>> This is because:
>>>
>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>
>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>
>>> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
>>> different results (this is expected because they have different
>>> mult/shift/equation).
>>>
>>> 4. However, the base in  kvmclock calculation (tsc_timestamp and system_time)
>>> are derived from raw monotonic clock (master clock)
>>
>> That just seems wrong. I don't mean that you're incorrect; it seems
>> *morally* wrong.
>>
>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>
>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>
>> Fix that, and the whole problem goes away, doesn't it?
>>
>> What am I missing here, that means we can't do that?
> 
> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
> ABI between KVM and KVM guests.
> 
> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
> of making things kinda sorta work with old hardware, i.e. was probably the least
> awful solution in the days before constant TSCs, but is completely nonsensical on
> modern hardware.
> 
>> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all?
>> If KVM wants to decide that the TSC runs at a different frequency to
>> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM
>> just *stick* to that?
> 
> Yeah, bouncing around guest time when the TSC is constant seems counterproductive.
> 
> However, why does any of this matter if the host has a constant TSC?  If that's
> the case, a sane setup will expose a constant TSC to the guest and the guest will
> use the TSC instead of kvmclock for the guest clocksource.
> 
> Dongli, is this for long-lived "legacy" guests that were created on hosts without
> a constant TSC?  If not, then why is kvmclock being used?  Or heaven forbid, are
> you running on hardware without a constant TSC? :-)

This is for test guests, and the host has all of below:

tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust


A clocksource is used for two things.


1. current_clocksource. Yes, I agree we should always use tsc on modern hardware.

Do we need to update the documentation to always suggest TSC when it is
constant, as I believe many users still prefer pv clock than tsc?

Thanks to tsc ratio scaling, the live migration will not impact tsc.

>From the source code, the rating of kvm-clock is still higher than tsc.

BTW., how about to decrease the rating if guest detects constant tsc?

166 struct clocksource kvm_clock = {
167         .name   = "kvm-clock",
168         .read   = kvm_clock_get_cycles,
169         .rating = 400,
170         .mask   = CLOCKSOURCE_MASK(64),
171         .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
172         .enable = kvm_cs_enable,
173 };

1196 static struct clocksource clocksource_tsc = {
1197         .name                   = "tsc",
1198         .rating                 = 300,
1199         .read                   = read_tsc,


2. The sched_clock.

The scheduling is impacted if there is big drift.

Fortunately, according to my general test (without RT requirement), the impact
is trivial unless the two masterclock *updates* are between a super long period.

In the past, the scheduling was impacted a lot when the masterclock was still
based on monotonic (not raw).

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53fafdbb8b21fa99dfd8376ca056bffde8cafc11


Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
operations (not only sched_clock), e.g., after line 300.

296 void __init kvmclock_init(void)
297 {
298         u8 flags;
299
300         if (!kvm_para_available() || !kvmclock)
301                 return;
302
303         if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
304                 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
305                 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
306         } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
307                 return;
308         }
309
310         if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
311                               kvmclock_setup_percpu, NULL) < 0) {
312                 return;
313         }
314
315         pr_info("kvm-clock: Using msrs %x and %x",
316                 msr_kvm_system_time, msr_kvm_wall_clock);
317
318         this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
319         kvm_register_clock("primary cpu clock");
320         pvclock_set_pvti_cpu0_va(hv_clock_boot);
321
322         if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
323                 pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
324
325         flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
326         kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
327
328         x86_platform.calibrate_tsc = kvm_get_tsc_khz;
329         x86_platform.calibrate_cpu = kvm_get_tsc_khz;
330         x86_platform.get_wallclock = kvm_get_wallclock;
331         x86_platform.set_wallclock = kvm_set_wallclock;
332 #ifdef CONFIG_X86_LOCAL_APIC
333         x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
334 #endif
335         x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
336         x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
337         kvm_get_preset_lpj();


Should I introduce a new param to disable no-kvm-sched-clock only, or to
introduce a new param to allow the selection of sched_clock?


Those may not resolve the issue in another thread. (I guess there is a chance
that to refresh the masterclock may reduce the drift in that case, although
never tried)

https://lore.kernel.org/all/00fba193-238e-49dc-fdc4-0b93f20569ec@oracle.com/

Thank you very much!

Dongli Zhang

> 
> Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact
> problematic configuration(s) will help us make a better decision on how to fix
> the mess.