[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71e6fa91-065c-4b28-ac99-fa71dfd499b9@linux.intel.com>
Date: Fri, 29 Mar 2024 15:25:47 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Isaku Yamahata <isaku.yamahata@...el.com>
Cc: "Huang, Kai" <kai.huang@...el.com>, "Zhang, Tina" <tina.zhang@...el.com>,
"isaku.yamahata@...ux.intel.com" <isaku.yamahata@...ux.intel.com>,
"seanjc@...gle.com" <seanjc@...gle.com>, "Yuan, Hang" <hang.yuan@...el.com>,
"sean.j.christopherson@...el.com" <sean.j.christopherson@...el.com>,
"Chen, Bo2" <chen.bo@...el.com>, "sagis@...gle.com" <sagis@...gle.com>,
"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
"Aktas, Erdem" <erdemaktas@...gle.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure
On 3/29/2024 4:39 AM, Isaku Yamahata wrote:
[...]
>>>>> How about this?
>>>>>
>>>>> /*
>>>>> * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
>>>>> * TDH.MNG.KEY.FREEID() to free the HKID.
>>>>> * Other threads can remove pages from TD. When the HKID is assigned, we need
>>>>> * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
>>>>> * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
>>>>> * present transient state of HKID.
>>>>> */
>>>> Could you elaborate why it is still possible to have other thread removing
>>>> pages from TD?
>>>>
>>>> I am probably missing something, but the thing I don't understand is why
>>>> this function is triggered by MMU release? All the things done in this
>>>> function don't seem to be related to MMU at all.
>>> The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
>>> we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
>>> TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
>>> TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
>>> to use TDH.PHYMEM.PAGE.RECLAIM().
>> Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
>> TDH.PHYMEM.PAGE.RECLAIM()?
>>
>> And does the difference matter in practice, i.e. did you see using the former
>> having noticeable performance downgrade?
> Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
> shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
> guest private page. If the guest has hundreds of GB, the difference can be
> tens of minutes.
>
> With HKID alive, we need to assume vcpu is alive.
> - TDH.MEM.PAGE.REMOVE()
> - TDH.PHYMEM.PAGE_WBINVD()
> - TLB shoot down
> - TDH.MEM.TRACK()
> - IPI to other vcpus
> - wait for other vcpu to exit
Do we have a way to batch the TLB shoot down.
IIUC, in current implementation, TLB shoot down needs to be done for
each page remove, right?
>
> After freeing HKID
> - TDH.PHYMEM.PAGE.RECLAIM()
> We already flushed TLBs and memory cache.
>
>
>>>> Freeing vcpus is done in
>>>> kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
>>>> this tdx_mmu_release_keyid() is called?
>>> guest memfd complicates things. The race is between guest memfd release and mmu
>>> notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
>>> including guest memfd.
>>>
>>> Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
>>> and kvm vm. The process is exiting. Please notice vhost increments the
>>> reference of the mmu to access guest (shared) memory.
>>>
>>> exit_mmap():
>>> Usually mmu notifier release is fired. But not yet because of vhost.
>>>
>>> exit_files()
>>> close vhost fd. vhost starts timer to issue mmput().
>> Why does it need to start a timer to issue mmput(), but not call mmput()
>> directly?
> That's how vhost implements it. It's out of KVM control. Other component or
> user space as other thread can get reference to mmu or FDs. They can keep/free
> them as they like.
>
>
>>> close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
>>> kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
>>> and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
>>> guest memory. Call kvm_put_kvm() at last.
>>>
>>> During unmapping on behalf of guest memfd, the timer of vhost fires to call
>>> mmput(). It triggers mmu notifier release.
>>>
>>> Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
>>> kvm_destroy_vm().
>>>
>>> It's ideal to free HKID first for efficiency. But KVM doesn't have control on
>>> the order of fds.
>> Firstly, what kinda performance efficiency gain are we talking about?
> 2 extra SEAMCALL + IPI sync for each guest private page. If the guest memory
> is hundreds of GB, the difference can be tens of minutes.
>
>
>> We cannot really tell whether it can be justified to use two different methods
>> to tear down SEPT page because of this.
>>
>> Even if it's worth to do, it is an optimization, which can/should be done later
>> after you have put all building blocks together.
>>
>> That being said, you are putting too many logic in this patch, i.e., it just
>> doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.
> I agree that this patch is too huge, and that we should break it into smaller
> patches.
>
>
>>>> But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
>>> Not necessarily.
>> I am wondering when is TDH.VP.FLUSH done? Supposedly it should be called when
>> we free vcpus? But again this means you need to call TDH.MNG.VPFLUSHDONE
>> _after_ freeing vcpus. And this looks conflicting if you make
>> tdx_mmu_release_keyid() being called from MMU notifier.
> tdx_mmu_release_keyid() call it explicitly for all vcpus.
Powered by blists - more mailing lists