linux-kernel - Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71e6fa91-065c-4b28-ac99-fa71dfd499b9@linux.intel.com>
Date: Fri, 29 Mar 2024 15:25:47 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Isaku Yamahata <isaku.yamahata@...el.com>
Cc: "Huang, Kai" <kai.huang@...el.com>, "Zhang, Tina" <tina.zhang@...el.com>,
 "isaku.yamahata@...ux.intel.com" <isaku.yamahata@...ux.intel.com>,
 "seanjc@...gle.com" <seanjc@...gle.com>, "Yuan, Hang" <hang.yuan@...el.com>,
 "sean.j.christopherson@...el.com" <sean.j.christopherson@...el.com>,
 "Chen, Bo2" <chen.bo@...el.com>, "sagis@...gle.com" <sagis@...gle.com>,
 "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
 "Aktas, Erdem" <erdemaktas@...gle.com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure



On 3/29/2024 4:39 AM, Isaku Yamahata wrote:

[...]
>>>>> How about this?
>>>>>
>>>>> /*
>>>>>    * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
>>>>>    * TDH.MNG.KEY.FREEID() to free the HKID.
>>>>>    * Other threads can remove pages from TD.  When the HKID is assigned, we need
>>>>>    * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
>>>>>    * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free.  Get lock to not
>>>>>    * present transient state of HKID.
>>>>>    */
>>>> Could you elaborate why it is still possible to have other thread removing
>>>> pages from TD?
>>>>
>>>> I am probably missing something, but the thing I don't understand is why
>>>> this function is triggered by MMU release?  All the things done in this
>>>> function don't seem to be related to MMU at all.
>>> The KVM releases EPT pages on MMU notifier release.  kvm_mmu_zap_all() does. If
>>> we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
>>> TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().  Because
>>> TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
>>> to use TDH.PHYMEM.PAGE.RECLAIM().
>> Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
>> TDH.PHYMEM.PAGE.RECLAIM()?
>>
>> And does the difference matter in practice, i.e. did you see using the former
>> having noticeable performance downgrade?
> Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
> shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
> guest private page.  If the guest has hundreds of GB, the difference can be
> tens of minutes.
>
> With HKID alive, we need to assume vcpu is alive.
> - TDH.MEM.PAGE.REMOVE()
> - TDH.PHYMEM.PAGE_WBINVD()
> - TLB shoot down
>    - TDH.MEM.TRACK()
>    - IPI to other vcpus
>    - wait for other vcpu to exit

Do we have a way to batch the TLB shoot down.
IIUC, in current implementation, TLB shoot down needs to be done for 
each page remove, right?


>
> After freeing HKID
> - TDH.PHYMEM.PAGE.RECLAIM()
>    We already flushed TLBs and memory cache.
>
>
>>>> Freeing vcpus is done in
>>>> kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
>>>> this tdx_mmu_release_keyid() is called?
>>> guest memfd complicates things.  The race is between guest memfd release and mmu
>>> notifier release.  kvm_arch_destroy_vm() is called after closing all kvm fds
>>> including guest memfd.
>>>
>>> Here is the example.  Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
>>> and kvm vm.  The process is exiting.  Please notice vhost increments the
>>> reference of the mmu to access guest (shared) memory.
>>>
>>> exit_mmap():
>>>    Usually mmu notifier release is fired. But not yet because of vhost.
>>>
>>> exit_files()
>>>    close vhost fd. vhost starts timer to issue mmput().
>> Why does it need to start a timer to issue mmput(), but not call mmput()
>> directly?
> That's how vhost implements it.  It's out of KVM control.  Other component or
> user space as other thread can get reference to mmu or FDs.  They can keep/free
> them as they like.
>
>
>>>    close guest_memfd.  kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
>>>      kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
>>>      and TDH.MEM.PAGE.REMOVE().  This takes time because it processes whole
>>>      guest memory. Call kvm_put_kvm() at last.
>>>
>>>    During unmapping on behalf of guest memfd, the timer of vhost fires to call
>>>    mmput().  It triggers mmu notifier release.
>>>
>>>    Close kvm vcpus/vm. they call kvm_put_kvm().  The last one calls
>>>    kvm_destroy_vm().
>>>
>>> It's ideal to free HKID first for efficiency. But KVM doesn't have control on
>>> the order of fds.
>> Firstly, what kinda performance efficiency gain are we talking about?
> 2 extra SEAMCALL + IPI sync for each guest private page.  If the guest memory
> is hundreds of GB, the difference can be tens of minutes.
>
>
>> We cannot really tell whether it can be justified to use two different methods
>> to tear down SEPT page because of this.
>>
>> Even if it's worth to do, it is an optimization, which can/should be done later
>> after you have put all building blocks together.
>>
>> That being said, you are putting too many logic in this patch, i.e., it just
>> doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.
> I agree that this patch is too huge, and that we should break it into smaller
> patches.
>
>
>>>> But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
>>> Not necessarily.
>> I am wondering when is TDH.VP.FLUSH done?  Supposedly it should be called when
>> we free vcpus?  But again this means you need to call TDH.MNG.VPFLUSHDONE
>> _after_ freeing vcpus.  And this  looks conflicting if you make
>> tdx_mmu_release_keyid() being called from MMU notifier.
> tdx_mmu_release_keyid() call it explicitly for all vcpus.