lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com>
Date:   Fri, 10 Feb 2023 19:17:05 +0100
From:   Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>
To:     Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        Tianyu Lan <ltykernel@...il.com>,
        "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
Subject: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on
 Hyper-V

Hi Paolo/Sean,

We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing"
conflict with a nested Hyper-V enlightenment that is always enabled on AMD CPUs 
(HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that is affected is L0 Hyper-V + L1 KVM on AMD,

L2 VMs fail to boot due to to stale data being seen on L1/L2 side, it looks
like the NPT is not in sync with L0. I can reproduce this on any kernel >=5.18,
the easiest way is by launching qemu in a loop with debug OVMF, you can observe
various #GP faults, assert failures, or the guest just suddenly dies. You can try it
for yourself in Azure by launching an Ubuntu 22.10 image on an AMD SKU with nested
virtualization (Da_v5).

In investigating I found that 3 things allow L2 guests to boot again:
* force tdp_mmu=N when loading kvm
* recompile L1 kernel to force disable HV_X64_NESTED_ENLIGHTENED_TLB
* revert both of these commits (found through bisecting):
bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages"
efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct roots via asynchronous worker"

I'll paste our understanding of what is happening (thanks Tianyu):
"""
Hyper-V provides HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE
and HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST hvcalls for l1
hypervisor to notify Hyper-V after L1 hypervisor changes L2 GPA <-> L1 GPA address
translation tables(Intel calls EPT and AMD calls NPT). This may help not to
mask whole address translation tables of L1 hypervisor to be write-protected in Hyper-V
and avoid vmexits triggered by changing address translation table in L1 hypervisor. 

The following commits defers to call these two hvcalls when there are changes in the L1
hypervisor address translation table. Hyper-V can't sync/shadow L1 address space
table at the first time due to the delay and this may cause mismatch between shadow page table
in the Hyper-V and L1 address translation table. IIRC, KVM side always uses write-protected
translation table to shadow and so doesn't meet such issue with the commit.
"""

Let me know if either of you have any ideas on how to approach fixing this.
I'm not familiar enough with TDP MMU code to be able to contribute a fix directly
but I'm happy to help in any way I can.

Jeremi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ