lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 17 Sep 2020 14:07:22 +0300
From:   Maxim Levitsky <mlevitsk@...hat.com>
To:     kvm@...r.kernel.org
Cc:     linux-kernel@...r.kernel.org,
        x86@...nel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)),
        Jim Mattson <jmattson@...gle.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Borislav Petkov <bp@...en8.de>, Joerg Roedel <joro@...tes.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Maxim Levitsky <mlevitsk@...hat.com>
Subject: [PATCH 0/1] KVM: correctly restore the TSC value on nested migration

This patch is a result of a long investigation I made to understand
why the nested migration more often than not makes the nested guest hang.
Sometimes the nested guest recovers and sometimes it hangs forever.

The root cause of this is that reading MSR_IA32_TSC while nested guest is
running returns its TSC value, that is (assuming no tsc scaling)
host tsc + L1 tsc offset + L2 tsc offset.

This is correct but it is a result of a nice curiosity of X86 VMX
(and apparently SVM too, according to my tests) implementation:

As a rule MSR reads done by the guest should either trap to host, or just
return host value, and therefore kvm_get_msr and friends, should basically
always return the L1 value of any msr.

Well, MSR_IA32_TSC is an exception. Intel's PRM states that when you disable
its interception, then in guest mode the host adds the TSC offset to
the read value.

I haven't found anything like that in AMD's PRM but according to the few
tests I made, it behaves the same.

However, there is no such exception when writing MSR_IA32_TSC, and this
poses a problem for nested migration.

When MSR_IA32_TSC is read, we read L2 value (smaller since L2 is started
after L1), and when we restore it after migration, the value is interpreted
as L1 value, thus resulting in huge TSC jump backward which the guest usually
really doesn't like, especially on AMD with APIC deadline timer, which
usually just doesn't fire afterward sending the guest into endless wait for it.

The proposed patch fixes this by making read of MSR_IA32_TSC depend on
'msr_info->host_initiated'

If guest reads the MSR, we add the TSC offset, but when host's qemu reads
the msr we skip that silly emulation of TSC offset, and return the real value
for the L1 guest which is host tsc + L1 offset.

This patch was tested on both SVM and VMX, and on both it fixes hangs.
On VMX since it uses VMX preemption timer for APIC deadline, the guest seems
to recover after a while without that patch.

To make sure that the nested migration happens I usually used
-overcommit cpu_pm=on but I reproduced this with just running an endless loop
in L2.

This is tested both with and without -invtsc,tsc-frequency=...

The migration was done by saving the migration stream to a file, and then
loading the qemu with '-incoming'

Maxim Levitsky (1):
  KVM: x86: fix MSR_IA32_TSC read for nested migration

 arch/x86/kvm/x86.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

-- 
2.26.2


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ