linux-kernel - Re: recalibrating x86 TSC during suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1902221212190.1777@nanos.tec.linutronix.de>
Date:   Fri, 22 Feb 2019 12:44:39 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Olaf Hering <olaf@...fle.de>
cc:     John Stultz <john.stultz@...aro.org>,
        Stephen Boyd <sboyd@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: recalibrating x86 TSC during suspend/resume

On Fri, 22 Feb 2019, Olaf Hering wrote:
> Is there a way to recalibrate the x86 TSC during a suspend/resume cycle?

No.

> While the frequency will remain the same on a Laptop, it may (or rather:
> it definitly will) differ if a VM is migrated from one host to another.
> The hypervisor may choose to emulate the expected TSC frequency on the
> destination host, but this emulation comes with a significant
> performance cost. Therefore it would be good if the kernel evaluates the
> environment during resume.
> 
> The specific usecase I have is a workload within VMs that makes heavy
> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
> because only this clocksource gives enough granularity. The default
> paravirtualized clock will return the same values via
> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
> short. This does not happen with 'clocksource=tsc'.
> 
> Right now it is not possible to migrate VMs to hosts with different CPU
> speeds. This leads to "islands" of identical hardware, and makes
> maintenance of hosts harder than it needs to be. If the VM kernel would
> be able to cope with CPU/TSC frequency changes, the pool of potential
> destination hosts will become significant larger.

The problem with recalibrating TSC on resume is that it would have to be

    1) quick

    2) accurate, so NTP does not get utterly unhappy.

Newer Intels support TSC scaling for VMX, which could solve the problem. It
affects TSC readout by:

	TSC = (read(HWTSC) * multiplier) >> 48

So you can standarize on a TSC frequency accross a fleet. Not sure when
that was introduced and no idea whether it's available on AMD.

For a software solution we could try the following:

 1) Provide the raw TSC frequency of the host to the guest in some magic
    software defined MSR or CPUID. If there is an existing mechanism, use
    that.

 2) On resume check whether the MSR/CPUID is available and if so readout
    that information and check whether the frequency is the same as
    before. If not it is trivial enough to adjust the guest mult/shift
    values for both raw and NTP adjusted clocks before they are used again,
    i.e. before timekeeping_resume(). Need to look what's the best place,
    but probably the clocksource resume callback. Plus if TSC deadline
    timer is used, we'd need the same adjustment there.

    That's backward compatible, because if the MSR/CPUID is not there, then
    the recalibration is not tried.

Whether that is accurate enough or not to make NTP happy, I can't tell, but
it's definitely worth a try.

Thanks,

	tglx