linux-kernel - Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <50e09676-4dfc-473f-8b34-7f7a98ab5228@intel.com>
Date: Tue, 14 May 2024 14:01:12 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Isaku Yamahata
	<isaku.yamahata@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>, Paolo Bonzini
	<pbonzini@...hat.com>, Erdem Aktas <erdemaktas@...gle.com>, Sagi Shahar
	<sagis@...gle.com>, Bo2 Chen <chen.bo@...el.com>, Hang Yuan
	<hang.yuan@...el.com>, Tina Zhang <tina.zhang@...el.com>,
	<isaku.yamahata@...ux.intel.com>
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend
 specific



On 11/05/2024 2:04 am, Sean Christopherson wrote:
> On Thu, May 09, 2024, Isaku Yamahata wrote:
>> On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@...el.com> wrote:
>>> On 10/05/2024 10:52 am, Sean Christopherson wrote:
>>>> On Fri, May 10, 2024, Kai Huang wrote:
>>>>> On 10/05/2024 4:35 am, Sean Christopherson wrote:
>>>>>> KVM x86 limits KVM_MAX_VCPUS to 4096:
>>>>>>
>>>>>>      config KVM_MAX_NR_VCPUS
>>>>>> 	int "Maximum number of vCPUs per KVM guest"
>>>>>> 	depends on KVM
>>>>>> 	range 1024 4096
>>>>>> 	default 4096 if MAXSMP
>>>>>> 	default 1024
>>>>>> 	help
>>>>>>
>>>>>> whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
>>>>>> a 16-bit unsigned value:
>>>>>>
>>>>>>      #define TDX_MAX_VCPUS  (~(u16)0)
>>>>>>
>>>>>> i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
>>>>>> And _if_ it becomes a problem, we don't necessarily need to have a different
>>>>>> _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
>>>>>> being <= 64k.
>>>>>
>>>>> Actually later versions of TDX module (starting from 1.5 AFAICT), the module
>>>>> has a metadata field to report the maximum vCPUs that the module can support
>>>>> for all TDX guests.
>>>>
>>>> My quick glance at the 1.5 source shows that the limit is still effectively
>>>> 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
>>>> max at runtime and simply refuse to use a TDX module that has dropped the minimum
>>>> below 0xffff.
>>>
>>> I need to double check why this metadata field was added.  My concern is in
>>> future module versions they may just low down the value.
>>
>> TD partitioning would reduce it much.
> 
> That's still not a reason to plumb in what is effectively dead code.  Either
> partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
> the limitations to userspace, or the TDX-module is potentially breaking existing
> use cases.

The 'max_vcpus_per_td' global metadata fields is static for the TDX 
module.  If the module supports the TD partitioning, it just reports 
some smaller value regardless whether we opt-in TDX partitioning or not.

I think the point is this 'max_vcpus_per_td' is TDX architectural thing 
and kernel should not make any assumption of the value of it.  The 
architectural behaviour is:

   If the module has this 'max_vcpus_per_td', software should read and
   use it; Otherwise software should treat it as U16_MAX.

Thus I don't think we will need a new uAPI (TDX specific presumably) 
just for TD partitioning.  And this doesn't break existing use cases.

In fact, this doesn't prevent us from making the KVM_CAP_MAX_VCPUS code 
generic, e.g., we can do below:

1) In tdx_vm_init() (called via KVM_VM_CREATE -> vt_vm_init()), we do:

	kvm->max_vcpus = min(kvm->max_vcpus,
				tdx_info->max_vcpus_per_td);

2) In kvm_vm_ioctl_enable_cap_generic(), we add support to handle 
KVM_CAP_MAX_VCPUS to have the generic code to do:

	if (new_max_vcpus > kvm->max_vcpus)
		return -EINVAL;

	kvm->max_vcpus = new_max_vcpus;

However this means we only allow "lowing down" the kvm->max_vcpus in the 
kvm_vm_ioctl_enable_cap_generic(KVM_CAP_MAX_VCPUS), but I think this is 
acceptable?

If it is a concern, alternatively, we can add a new 
'kvm->hard_max_vcpus' (or whatever makes sense), and set it in 
kvm_create_vm() right after kvm_arch_init_vm():

	r = kvm_arch_init_vm(kvm, type);
         if (r)
                 goto out_err_no_arch_destroy_vm;

	kvm->hard_max_vcpus = kvm->max_vcpus;

So it always contains "the max_vcpus limited by the ARCH 
hardware/firmware etc".

And in kvm_vm_ioctl_enable_cap_generic(), we check against 
kvm->hard_max_vcpus instead to get rid of the limitation of only 
allowing lowing down the kvm->max_vcpus.

But I don't think this is necessary at this stage.