linux-kernel - Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87k0togikr.fsf@nanos.tec.linutronix.de>
Date:   Fri, 11 Dec 2020 22:04:52 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Marcelo Tosatti <mtosatti@...hat.com>
Cc:     Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org,
        "H. Peter Anvin" <hpa@...or.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Jim Mattson <jmattson@...gle.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        "open list\:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        open list <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        "maintainer\:X86 ARCHITECTURE \(32-BIT AND 64-BIT\)" <x86@...nel.org>,
        Joerg Roedel <joro@...tes.org>, Borislav Petkov <bp@...en8.de>,
        Shuah Khan <shuah@...nel.org>,
        Andrew Jones <drjones@...hat.com>,
        Oliver Upton <oupton@...gle.com>,
        "open list\:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

On Fri, Dec 11 2020 at 11:18, Marcelo Tosatti wrote:
> On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
> Unless you notify applications to invalidate their time reads,
> i can't see a way to fix this.

This is just wrong. Suspend/resume handles that fine and the system is
guaranteed to come back with time which is very close to the reality.

And for suspend/resume everything from kernel to userspace can have a
notification before suspend and post resume. So applications or those
parts of the kernel which are e.g. time sensitive can prepare upfront
for the disruption and mop up on resume.

> Therefore if you use VM migration in the first place, a certain amount of
> timestamp accuracy error must be tolerated.

That's just because it was never designed in the right way. And you
simply declared that all applications have to deal with that.

Again, where is this documented? VMs are subject to migration whether
the customer who pays for it wants it or not. None of the virt tool docs
mentions that pausing a VM for a long time makes timekeeping go
south.

I still have no sensible explanation WHY time should not advance accross
a migration. All you told me is that customers complained. Which
customers? The ones running the hosts or the ones paying for the VM?

It's all just decided by some folks to "fix" a problem with the pause/
migration mechanism they designed instead of fixing the design fail.

>> How can you even assume that this is correct?
>
> As noted above, even without a window of unsynchronized time (due to
> delay for NTP to sync time), time reads can be stale.

So with suspend/resume we have:

app:
   t = clock_gettime()
        <---------------- tsuspend
        <-----------------tresume
        So t is now behind reality by tresume - tsuspend

  packet -> check timestamp .... ooops recheck
  t = clock_gettime()
  and t and timestamp are in the same ballpark again

Now with your thing:

app:
   t = clock_gettime()
        <---------------- tpause
        <-----------------tresume
        So t is now behind reality by tresume - tpause

  packet -> check timestamp .... ooops recheck
  t = clock_gettime()
  and t and timestamp are still apart by ~ (tresume - tpause)

this persists until NTP kicks in, if and only if NTP is running.

Can you spot the difference?

>> It is exactly the same problem as we had many years ago with hardware
>> clocks suddenly stopping to tick which caused quite some stuff to go
>> belly up.
>
> Customers complained when it was 5 seconds off, now its 0.1ms (and
> people seem happy).

And because customers complained you decided to create a scenario which
is completely different to all other scenarios and from a time keeping
POV not making any sense at all.

>> In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
>> (with a certain degree of accuracy) to compensate for the sleep time, so
>> the other end of a communication is at least in the same ballpark, but
>> not 50 seconds off.
>
> Its 100ms off with migration, and can be reduced further (customers
> complained about 5 seconds but seem happy with 0.1ms).

What is 100ms? Guaranteed maximum migration time?

CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and
this state persists up to the point where NTP corrects it with a time
jump.

So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms
it's off by 5 seconds.

CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.

> OK, makes sense, then reducing the 0.1ms window even further
> is a useful thing to do. What would be an acceptable 
> CLOCK_REALTIME accuracy error, on migration?

Can you please explain how you can achive 0.1ms accuracy when migration
time is more than that and guest TSC is just restored to the value at
which it was stopped?

Then ALL clocks including CLOCK_REALTIME and CLOCK_TAI continue from the
point at which they were stopped. Ergo:

      t(CLOCK_REALTIME) = t(REALITY) - t(STOPPED)

CLOCK_REALTIME and CLOCK_TAI are global clocks and they have rules which
have to be respected in order to make stuff work.

CLOCK_MONOTONIC and CLOCK_BOOTTIME are local to a system (host, guests).
So manipulating them is a completely different story albeit the kernel
has explicit guarantees for the relationship between CLOCK_MONOTONIC,
CLOCK_BOOTTIME and CLOCK_REALTIME/TAI

If you could guarantee t(STOPPED) < 100ms and therefore

   t(REALITY) - t(CLOCK_REALTIME) < 100ms

under _all_ circumstances then we would not even have that discussion.

Even < 1000ms might be acceptable. That's the margin of error which is
also happening accross bare metal suspend/resume in the case that the
sleep time has to be read from the RTC when TSC stops accross suspend.

But suspend/resume is still substantially different because everything
from kernel to userspace can have a notification before suspend. So
applications or those parts of the kernel which are e.g. time sensitive
can prepare upfront for the disruption. In your case, not at all. They
just have to cope with the fallout. Brilliant.

Your design or the lack of it just decides that everything has to cope
with what you decided is the right thing to do and for how long it
takes.

Yes it "works" by some definition of works, but that does not mean it
works correctly.

Thanks,

        tglx