linux-kernel - Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f1097d63-2266-4e2d-996a-a387f497b4c7@intel.com>
Date: Thu, 18 Apr 2024 13:09:31 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Isaku Yamahata <isaku.yamahata@...el.com>
CC: Sean Christopherson <seanjc@...gle.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>, Paolo Bonzini
	<pbonzini@...hat.com>, "Aktas, Erdem" <erdemaktas@...gle.com>, Sagi Shahar
	<sagis@...gle.com>, "Chen, Bo2" <chen.bo@...el.com>, "Yuan, Hang"
	<hang.yuan@...el.com>, "Zhang, Tina" <tina.zhang@...el.com>,
	"isaku.yamahata@...ux.intel.com" <isaku.yamahata@...ux.intel.com>
Subject: Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical
 processor



On 17/04/2024 4:44 am, Isaku Yamahata wrote:
> On Tue, Apr 16, 2024 at 12:05:31PM +1200,
> "Huang, Kai" <kai.huang@...el.com> wrote:
> 
>>
>>
>> On 16/04/2024 10:48 am, Yamahata, Isaku wrote:
>>> On Mon, Apr 15, 2024 at 06:49:35AM -0700,
>>> Sean Christopherson <seanjc@...gle.com> wrote:
>>>
>>>> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
>>>>> On Fri, Apr 12, 2024 at 03:46:05PM -0700,
>>>>> Sean Christopherson <seanjc@...gle.com> wrote:
>>>>>
>>>>>> On Fri, Apr 12, 2024, Isaku Yamahata wrote:
>>>>>>> On Fri, Apr 12, 2024 at 09:15:29AM -0700, Reinette Chatre <reinette.chatre@...el.com> wrote:
>>>>>>>>> +void tdx_mmu_release_hkid(struct kvm *kvm)
>>>>>>>>> +{
>>>>>>>>> +	while (__tdx_mmu_release_hkid(kvm) == -EBUSY)
>>>>>>>>> +		;
>>>>>>>>>    }
>>>>>>>>
>>>>>>>> As I understand, __tdx_mmu_release_hkid() returns -EBUSY
>>>>>>>> after TDH.VP.FLUSH has been sent for every vCPU followed by
>>>>>>>> TDH.MNG.VPFLUSHDONE, which returns TDX_FLUSHVP_NOT_DONE.
>>>>>>>>
>>>>>>>> Considering earlier comment that a retry of TDH.VP.FLUSH is not
>>>>>>>> needed, why is this while() loop here that sends the
>>>>>>>> TDH.VP.FLUSH again to all vCPUs instead of just a loop within
>>>>>>>> __tdx_mmu_release_hkid() to _just_ resend TDH.MNG.VPFLUSHDONE?
>>>>>>>>
>>>>>>>> Could it be possible for a vCPU to appear during this time, thus
>>>>>>>> be missed in one TDH.VP.FLUSH cycle, to require a new cycle of
>>>>>>>> TDH.VP.FLUSH?
>>>>>>>
>>>>>>> Yes. There is a race between closing KVM vCPU fd and MMU notifier release hook.
>>>>>>> When KVM vCPU fd is closed, vCPU context can be loaded again.
>>>>>>
>>>>>> But why is _loading_ a vCPU context problematic?
>>>>>
>>>>> It's nothing problematic.  It becomes a bit harder to understand why
>>>>> tdx_mmu_release_hkid() issues IPI on each loop.  I think it's reasonable
>>>>> to make the normal path easy and to complicate/penalize the destruction path.
>>>>> Probably I should've added comment on the function.
>>>>
>>>> By "problematic", I meant, why can that result in a "missed in one TDH.VP.FLUSH
>>>> cycle"?  AFAICT, loading a vCPU shouldn't cause that vCPU to be associated from
>>>> the TDX module's perspective, and thus shouldn't trigger TDX_FLUSHVP_NOT_DONE.
>>>>
>>>> I.e. looping should be unnecessary, no?
>>>
>>> The loop is unnecessary with the current code.
>>>
>>> The possible future optimization is to reduce destruction time of Secure-EPT
>>> somehow.  One possible option is to release HKID while vCPUs are still alive and
>>> destruct Secure-EPT with multiple vCPU context.  Because that's future
>>> optimization, we can ignore it at this phase.
>>
>> I kinda lost here.
>>
>> I thought in the current v19 code, you have already implemented this
>> optimization?
>>
>> Or is this optimization totally different from what we discussed in an
>> earlier patch?
>>
>> https://lore.kernel.org/lkml/8feaba8f8ef249950b629f3a8300ddfb4fbcf11c.camel@intel.com/
> 
> That's only the first step.  We can optimize it further with multiple vCPUs
> context.

OK.  Let's put aside how important these optimizations are and whether 
they should be done in the initial TDX support, I think the right way to 
organize the patches is to bring functionality first and then put 
performance optimization later.

That can make both writing code and code review easier.

And more importantly, the "performance optimization" can be discussed 
_separately_.

For example, as mentioned in the link above, I think the optimization of 
"releasing HKID in the MMU notifier release to improve TD teardown 
latency" complicates things a lot, e.g., not only to the TD 
creation/teardown sequence, but also here -- w/o it we don't even need 
to consider the race between vCPU load and MMU notifier release:

https://lore.kernel.org/kvm/20240412214201.GO3039520@ls.amr.corp.intel.com/

So I think we should start with implementing these sequences in "normal 
way" first, and then do the optimization(s) later.

And to me the "normal way" for TD creation/destruction we can just do:

1) Use normal SEPT sequence to teardown private EPT page table;
2) Do VP.FLUSH when vCPU is destroyed
3) Do VPFLUSHDONE after all vCPUs are destroyed
4) release HKID at last stage of destroying VM.

For vCPU migration, you do VP.FLUSH on the old pCPU before you load the 
vCPU to the new pCPU, as shown in this patch.

Then you don't need to cover the silly code change around 
tdx_mmu_release_hkid() in this patch.

Am I missing anything?