linux-kernel - Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7846c5452644a3c029d27361759cb82c2b8d4f2e.camel@redhat.com>
Date:   Thu, 10 Dec 2020 16:25:57 +0200
From:   Maxim Levitsky <mlevitsk@...hat.com>
To:     Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Oliver Upton <oupton@...gle.com>
Cc:     kvm list <kvm@...r.kernel.org>, "H. Peter Anvin" <hpa@...or.com>,
        Jonathan Corbet <corbet@....net>,
        Jim Mattson <jmattson@...gle.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        open list <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" <x86@...nel.org>,
        Joerg Roedel <joro@...tes.org>, Borislav Petkov <bp@...en8.de>,
        Shuah Khan <shuah@...nel.org>,
        Andrew Jones <drjones@...hat.com>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
> On 08/12/20 18:08, Maxim Levitsky wrote:
> > > Even if you support TSCADJUST and let the guest write to it does not
> > > change the per guest offset at all. TSCADJUST is per [v]CPU and adds on
> > > top:
> > > 
> > >      tscvcpu = tsc_host + guest_offset + TSC_ADJUST
> > > 
> > > Scaling is just orthogonal and does not change any of this.
> > 
> > I agree with this, and I think that this is what we will end up doing.
> > Paulo, what do you think about this?
> 
> Yes, you can have a VM ioctl that saves/restores cur_tsc_nsec and 
> cur_tsc_write.  The restore side would loop on all vcpus.

Why would the restore need to run on all VCPUs though?

The picture that I have in mind is that we we will restore the 
cur_tsc_nsec/cur_tsc_write pair once per VM, and then restore the 
TSC_ADJUST value on all vCPUs as if the guest wrote it (with the quirk disabled), 
which will restore all the horrible things that the guest did to its TSC.

Since TSC adjust is enabled by default, if the user disables it in the CPUID, 
we might as well just disable the whole thing, 
make TSC readonly or something similar.

This way we don't need to worry about TSC writes as 
those will be reflected in the TSC_ADJUST.

> 
> However, it is not so easy: 1) it would have to be usable only if 
> KVM_X86_QUIRK_TSC_HOST_ACCESS is false, 2) it would fail if 
> kvm->arch.nr_vcpus_matched_tsc == kvm->online_vcpus (which basically 
> means that userspace didn't mess up the TSC configuration).  If not, it 
> would return -EINVAL.

Note though that we don't track the guest tsc/tsc_adjust writes anymore
via the tsc sync code, and with the quirk disabled we don't track them
even for the host writes, thus the (2) condition will always be true
(the only call left to kvm_synchronize_tsc is in the kvm_arch_vcpu_postcreate)

However I indeed see no reason to allow new per VM API when the masterclock is
disabled, and therefore using cur_tsc_nsec/cur_tsc_write is reasonable.

The cur_tsc_nsec should IMHO be converted to absolute time (CLOCK_REALTIME
or similiar) or should the conversion be done in the userspace? 
I don't know yet if I can convert between different POSIX clocks
in race/error free manner. (e.g get the offset between them).


> 
> Also, while at it let's burn and pour salt on the support for 
> KVM_SET_TSC_KHZ unless TSC scaling is supported, together with 
> vcpu->tsc_catchup and all the "tolerance" crap that is in 
> kvm_set_tsc_khz.  And initialize vcpu->arch.virtual_tsc_khz to 
> kvm->arch.last_tsc_khz before calling kvm_synchronize_tsc.

The tsc_catchup is enabled when host TSC is unstable, so that guest
TSC at least roughtly follows real time (which host kernel gets
by other means).

We push the guest tsc forward roughtly each time VM entry from userspace
happens:

(On each kvm_arch_vcpu_load, we raise KVM_REQ_GLOBAL_CLOCK_UPDATE
request which (with small delay) makes all vcpus go through 
kvm_guest_time_update which pushes the guest tsc forward)

However we also have the 'tsc_always_catchup' mode which is indeed 
something I would like to remove.

It is a poor man simulation of the TSC scaling which is enabled 
when the host doesn't support TSC scaling, but we were asked 
to run at frequency faster than host TSC frequency is.

This way guest tsc runs slower than it should, but on each VM exit,
we punt the guest tsc offset forward so on average the guest tsc
doesn't lag behind.

Unless backward compatibility is an issue, I have no objections
to remove that code.

The tolerance thing might be needed. On many systems 
(including mine new 3970X), the hardware doesn't have means to report 
the unscaled host TSC frequency, thus the kernel has to 
measure it, and since the measurement is not 100% accurate, a slightly
different value is used every time the host boots.

Thus without a small tolerance, we will always have to use TSC scaling,
while migrating even between two identical systems.
I don't know if this is an issue.


Best regards,
	Maxim Levitsky


> 
> Paolo
>