linux-kernel - Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <234c9998-c314-44bb-ad96-6af2cece7465@intel.com>
Date: Thu, 28 Mar 2024 21:21:37 +0800
From: Xiaoyao Li <xiaoyao.li@...el.com>
To: Chao Gao <chao.gao@...el.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 "Yamahata, Isaku" <isaku.yamahata@...el.com>,
 "Zhang, Tina" <tina.zhang@...el.com>, "seanjc@...gle.com"
 <seanjc@...gle.com>, "Huang, Kai" <kai.huang@...el.com>,
 "Chen, Bo2" <chen.bo@...el.com>, "sagis@...gle.com" <sagis@...gle.com>,
 "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "Aktas, Erdem" <erdemaktas@...gle.com>,
 "isaku.yamahata@...ux.intel.com" <isaku.yamahata@...ux.intel.com>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>, "Yuan, Hang"
 <hang.yuan@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for
 unsupported cases

On 3/28/2024 6:17 PM, Chao Gao wrote:
> On Thu, Mar 28, 2024 at 11:40:27AM +0800, Xiaoyao Li wrote:
>> On 3/28/2024 11:04 AM, Edgecombe, Rick P wrote:
>>> On Thu, 2024-03-28 at 09:30 +0800, Xiaoyao Li wrote:
>>>>> The current ABI of KVM_EXIT_X86_RDMSR when TDs are created is nothing. So I don't see how this
>>>>> is
>>>>> any kind of ABI break. If you agree we shouldn't try to support MTRRs, do you have a different
>>>>> exit
>>>>> reason or behavior in mind?
>>>>
>>>> Just return error on TDVMCALL of RDMSR/WRMSR on TD's access of MTRR MSRs.
>>>
>>> MTRR appears to be configured to be type "Fixed" in the TDX module. So the guest could expect to be
>>> able to use it and be surprised by a #GP.
>>>
>>>           {
>>>             "MSB": "12",
>>>             "LSB": "12",
>>>             "Field Size": "1",
>>>             "Field Name": "MTRR",
>>>             "Configuration Details": null,
>>>             "Bit or Field Virtualization Type": "Fixed",
>>>             "Virtualization Details": "0x1"
>>>           },
>>>
>>> If KVM does not support MTRRs in TDX, then it has to return the error somewhere or pretend to
>>> support it (do nothing but not return an error). Returning an error to the guest would be making up
>>> arch behavior, and to a lesser degree so would ignoring the WRMSR.
>>
>> The root cause is that it's a bad design of TDX to make MTRR fixed1. When
>> guest reads MTRR CPUID as 1 while getting #VE on MTRR MSRs, it already breaks
>> the architectural behavior. (MAC faces the similar issue , MCA is fixed1 as
> 
> I won't say #VE on MTRR MSRs breaks anything. Writes to other MSRs (e.g.
> TSC_DEADLINE MSR) also lead to #VE. If KVM can emulate the MSR accesses, #VE
> should be fine.
> 
> The problem is: MTRR CPUID feature is fixed 1 while KVM/QEMU doesn't know how
> to virtualize MTRR especially given that KVM cannot control the memory type in
> secure-EPT entries.

yes, I partly agree on that "#VE on MTRR MSRs breaks anything". #VE is 
not a problem, the problem is if the #VE is opt-in or unconditional.

For the TSC_DEADLINE_MSR, #VE is opt-in actually. 
CPUID(1).EXC[24].TSC_DEADLINE is configurable by VMM. Only when VMM 
configures the bit to 1, will the TD guest get #VE. If VMM configures it 
to 0, TD guest just gets #GP. This is the reasonable design.

>> well while accessing MCA related MSRs gets #VE. This is why TDX is going to
>> fix them by introducing new feature and make them configurable)
>>
>>> So that is why I lean towards
>>> returning to userspace and giving the VMM the option to ignore it, return an error to the guest or
>>> show an error to the user.
>>
>> "show an error to the user" doesn't help at all. Because user cannot fix it,
>> nor does QEMU.
> 
> The key point isn't who can fix/emulate MTRR MSRs. It is just KVM doesn't know
> how to handle this situation and ask userspace for help.
> 
> Whether or how userspace can handle the MSR writes isn't KVM's problem. It may be
> better if KVM can tell userspace exactly in which cases KVM will exit to
> userspace. But there is no such an infrastructure.
> 
> An example is: in KVM CET series, we find it is complex for KVM instruction
> emulator to emulate control flow instructions when CET is enabled. The
> suggestion is also to punt to userspace (w/o any indication to userspace that
> KVM would do this).

Please point me to decision of CET? I'm interested in how userspace can 
help on that.

>>
>>> If KVM can't support the behavior, better to get an actual error in
>>> userspace than a mysterious guest hang, right?
>> What behavior do you mean?
>>
>>> Outside of what kind of exit it is, do you object to the general plan to punt to userspace?
>>>
>>> Since this is a TDX specific limitation, I guess there is KVM_EXIT_TDX_VMCALL as a general category
>>> of TDVMCALLs that cannot be handled by KVM.
> 
> Using KVM_EXIT_TDX_VMCALL looks fine.
> 
> We need to explain why MTRR MSRs are handled in this way unlike other MSRs.
> 
> It is better if KVM can tell userspace that MTRR virtualization isn't supported
> by KVM for TDs. Then userspace should resolve the conflict between KVM and TDX
> module on MTRR. But to report MTRR as unsupported, we need to make
> GET_SUPPORTED_CPUID a vm-scope ioctl. I am not sure if it is worth the effort.

My memory is that Sean dislike the vm-scope GET_SUPPORTED_CPUID for TDX 
when he was at Intel.

Anyway, we can provide TDX specific interface to report SUPPORTED_CPUID 
in KVM_TDX_CAPABILITIES, if we really need it.

> 
>>
>> I just don't see any difference between handling it in KVM and handling it in
>> userspace: either a) return error to guest or b) ignore the WRMSR.