linux-kernel - Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5bb2d7fc-cfe9-4abd-a291-7ad56db234b3@intel.com>
Date: Tue, 18 Jun 2024 11:44:07 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Tina Zhang <tina.zhang@...el.com>, Hang Yuan <hang.yuan@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, Bo2 Chen <chen.bo@...el.com>,
	"sagis@...gle.com" <sagis@...gle.com>, "isaku.yamahata@...ux.intel.com"
	<isaku.yamahata@...ux.intel.com>, Erdem Aktas <erdemaktas@...gle.com>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, Isaku Yamahata <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v19 037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend
 specific



On 15/06/2024 12:04 pm, Sean Christopherson wrote:
> On Fri, Jun 14, 2024, Kai Huang wrote:
>> On Tue, 2024-06-04 at 10:48 +0000, Huang, Kai wrote:
>>> On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote:
>>>> On Thu, May 30, 2024, Kai Huang wrote:
>>>>> On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
>>>>>> In the unlikely event there is a legitimate reason for max_vcpus_per_td being
>>>>>> less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
>>>>>> that's purely theoretical at this point, i.e. this is all much ado about nothing.
>>>>>
>>>>> I am afraid we already have a legitimate case: TD partitioning.  Isaku
>>>>> told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
>>>>> partitioning supported.  And again this is static, i.e., doesn't require
>>>>> TD partitioning to be opt-in to low to 512.
>>>>
>>>> So what's Intel's plan for use cases that creates TDs with >512 vCPUs?
>>>
>>> I checked with TDX module guys.  Turns out the 'max_vcpus_per_td' wasn't
>>> introduced because of TD partitioning, and they are not actually related.
>>>
>>> They introduced this to support "topology virtualization", which requires
>>> a table to record the X2APIC IDs for all vcpus for each TD.  In practice,
>>> given a TDX module, the 'max_vcpus_per_td', a.k.a, the X2APIC ID table
>>> size reflects the physical logical cpus that *ALL* platforms that the
>>> module supports can possibly have.
>>>
>>> The reason of this design is TDX guys don't believe there's sense in
>>> supporting the case where the 'max_vcpus' for one single TD needs to
>>> exceed the physical logical cpus.
>>>
>>> So in short:
>>>
>>> - The "max_vcpus_per_td" can be different depending on module versions. In
>>> practice it reflects the maximum physical logical cpus that all the
>>> platforms (that the module supports) can possibly have.
>>>
>>> - Before CSPs deploy/migrate TD on a TDX machine, they must be aware of
>>> the "max_vcpus_per_td" the module supports, and only deploy/migrate TD to
>>> it when it can support.
>>>
>>> - For TDX 1.5.xx modules, the value is 576 (the previous number 512 isn't
>>> correct); For TDX 2.0.xx modules, the value is larger (>1000).  For future
>>> module versions, it could have a smaller number, depending on what
>>> platforms that module needs to support.  Also, if TDX ever gets supported
>>> on client platforms, we can image the number could be much smaller due to
>>> the "vcpus per td no need to exceed physical logical cpus".
>>>
>>> We may ask them to support the case where 'max_vcpus' for single TD
>>> exceeds the physical logical cpus, or at least not to low down the value
>>> any further for future modules (> 2.0.xx modules).  We may also ask them
>>> to give promise to not low the number to below some certain value for any
>>> future modules.  But I am not sure there's any concrete reason to do so?
>>>
>>> What's your thinking?
> 
> It's a reasonable restriction, e.g. KVM_CAP_NR_VCPUS is already capped at number
> of online CPUs, although userspace is obviously allowed to create oversubscribed
> VMs.
> 
> I think the sane thing to do is document that TDX VMs are restricted to the number
> of logical CPUs in the system, have KVM_CAP_MAX_VCPUS enumerate exactly that, and
> then sanity check that max_vcpus_per_td is greater than or equal to what KVM
> reports for KVM_CAP_MAX_VCPUS. >
> Stating that the maximum number of vCPUs depends on the whims TDX module doesn't
> provide a predictable ABI for KVM, i.e. I don't want to simply forward TDX's
> max_vcpus_per_td to userspace.


This sounds good to me.  I think it should be also OK for client too, if 
TDX ever gets supported for client.

IIUC we can consult the @nr_cpu_ids or num_possible_cpus() to get the 
"number of logical CPUs in the system".  And we can reject to use the 
TDX module if 'max_vcpus_per_td' turns to be smaller.

I think the relevant question is is whether we should still report 
"number of logical CPUs in the system" via KVM_CAP_MAX_VCPUS?  Because 
if doing so, this still means the userspace will need to check 
KVM_CAP_MAX_VCPUS vm extention on per-vm basis.

And if it does, then from userspace's perspective, it actually doesn't 
matter whether underneath the per-vm KVM_CAP_MAX_VCPUS is limited by TDX 
or the system cpus (also see below).

The userspace cannot tell the difference anyway.  It just needs to 
change to query KVM_CAP_MAX_VCPUS to per-vm basis.

Or, we could limit this to TDX guest ONLY:

The KVM_CAP_MAX_VCPUS is still global.  However for TDX specifically, 
the userspace should use other way to query the number of LPs the system 
supports (I assume there should be existing ABI for this?).

But looks this isn't something nice?