linux-kernel - Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGtprH_o_Vbvk=jONSep64wRhAJ+Y51uZfX7-DDS28vh=ALQOA@mail.gmail.com>
Date: Thu, 27 Mar 2025 01:14:29 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Adrian Hunter <adrian.hunter@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, kvm@...r.kernel.org, 
	rick.p.edgecombe@...el.com, kirill.shutemov@...ux.intel.com, 
	kai.huang@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com, 
	tony.lindgren@...ux.intel.com, binbin.wu@...ux.intel.com, 
	isaku.yamahata@...el.com, linux-kernel@...r.kernel.org, yan.y.zhao@...el.com, 
	chao.gao@...el.com
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease
 shutdown time

On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@...el.com> wrote:
> ...
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18              72              24
>         32      107             517             134

If the time for reclaim grows linearly with memory size, then this is
a significantly high value for TD cleanup (~21 minutes for a 1TB VM).

>
> Note, the V19 patch set:
>
>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
>         "No, the right answer is to not release the HKID until the VM is
>         destroyed."
>
>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/

IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
separately. Through his proposal:
1) If userspace drops last reference on gmem inode before/after
dropping the VM reference
    -> slow S-EPT removal and slow page reclaim
2) If memslots are removed before closing the gmem and dropping the VM reference
    -> slow S-EPT page removal and no page reclaim until the gmem is around.

Reclaim should ideally happen when the host wants to use that memory
i.e. for following scenarios:
1) Truncation of private guest_memfd ranges
2) Conversion of private guest_memfd ranges to shared when supporting
in-place conversion (Could be deferred to the faulting in as shared as
well).

Would it be possible for you to provide the split of the time spent in
slow S-EPT page removal vs page reclaim?

It might be worth exploring the possibility of parallelizing or giving
userspace the flexibility to parallelize both these operations to
bring the cleanup time down (to be comparable with non-confidential VM
cleanup time for example).