linux-kernel - Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8d0a4585-9e48-4e8d-8acb-7cb99142654c@intel.com>
Date: Thu, 27 Mar 2025 12:10:05 +0200
From: Adrian Hunter <adrian.hunter@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>, <kvm@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kirill.shutemov@...ux.intel.com>,
	<kai.huang@...el.com>, <reinette.chatre@...el.com>, <xiaoyao.li@...el.com>,
	<tony.lindgren@...ux.intel.com>, <binbin.wu@...ux.intel.com>,
	<isaku.yamahata@...el.com>, <linux-kernel@...r.kernel.org>,
	<yan.y.zhao@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease
 shutdown time

On 27/03/25 10:14, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@...el.com> wrote:
>> ...
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>>         VCPUs   Size (GB)       Before (secs)   After (secs)
>>          4       18              72              24
>>         32      107             517             134
> 
> If the time for reclaim grows linearly with memory size, then this is
> a significantly high value for TD cleanup (~21 minutes for a 1TB VM).
> 
>>
>> Note, the V19 patch set:
>>
>>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>>         "No, the right answer is to not release the HKID until the VM is
>>         destroyed."
>>
>>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
> 
> IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
> separately. Through his proposal:

Thanks for looking at this!

It seems I am using the term "reclaim" wrongly.  Sorry!

I am talking about taking private memory away from the guest,
not what happens to it subsequently.  When the TDX VM is in "Runnable"
state, taking private memory away is slow (slow S-EPT removal).
When the TDX VM is in "Teardown" state, taking private memory away
is faster (a TDX SEAMCALL named TDH.PHYMEM.PAGE.RECLAIM which is where
I picked up the term "reclaim")

Once guest memory is removed from S-EPT, further action is not
needed to reclaim it.  It belongs to KVM at that point.

guest_memfd memory can be added directly to S-EPT.  No intermediate
state or step is used.  Any guest_memfd memory not given to the
MMU (S-EPT), can be freed directly if userspace/KVM wants to.
Again there is no intermediate state or (reclaim) step.

> 1) If userspace drops last reference on gmem inode before/after
> dropping the VM reference
>     -> slow S-EPT removal and slow page reclaim

Currently slow S-EPT removal happens when the file is released.

> 2) If memslots are removed before closing the gmem and dropping the VM reference
>     -> slow S-EPT page removal and no page reclaim until the gmem is around.
> 
> Reclaim should ideally happen when the host wants to use that memory
> i.e. for following scenarios:
> 1) Truncation of private guest_memfd ranges
> 2) Conversion of private guest_memfd ranges to shared when supporting
> in-place conversion (Could be deferred to the faulting in as shared as
> well).
> 
> Would it be possible for you to provide the split of the time spent in
> slow S-EPT page removal vs page reclaim?

Based on what I wrote above, all the time is spent removing pages
from S-EPT.  Greater that 99% of shutdown time is kvm_gmem_release().

> 
> It might be worth exploring the possibility of parallelizing or giving
> userspace the flexibility to parallelize both these operations to
> bring the cleanup time down (to be comparable with non-confidential VM
> cleanup time for example).