linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGtprH_SKJ4hbQ4aSxxybcsD=eSQraP7a4AkQ3SKuMm2=Oyp+A@mail.gmail.com>
Date: Mon, 16 Jun 2025 20:51:41 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, pbonzini@...hat.com, seanjc@...gle.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kirill.shutemov@...el.com, 
	tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > We need to restore to the previous status (which includes the host page table)
> > > if conversion can't be done.
> > > That said, in my view, a better flow would be:
> > >
> > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > >    consumers in kernel of memory allocated from guest_memfd).
> > >
> > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > >    proceed. For example, in the case of TDX, this might involve memory
> > >    allocation and page splitting.
> > >
> > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > >    proceeds by sending the actual invalidation request.
> > >
> > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > >    the page or elevate the page reference count.
> >
> > Few questions here:
> > 1) It sounds like the failure to remove entries from SEPT could only
> > be due to bugs in the KVM/TDX module,
> Yes.
>
> > how reliable would it be to
> > continue executing TDX VMs on the host once such bugs are hit?
> The TDX VMs will be killed. However, the private pages are still mapped in the
> SEPT (after the unmapping failure).
> The teardown flow for TDX VM is:
>
> do_exit
>   |->exit_files
>      |->kvm_gmem_release ==> (1) Unmap guest pages
>      |->release kvmfd
>         |->kvm_destroy_vm  (2) Reclaiming resources
>            |->kvm_arch_pre_destroy_vm  ==> Release hkid
>            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
>
> Without holding page reference after (1) fails, the guest pages may have been
> re-assigned by the host OS while they are still still tracked in the TDX module.

What happens to the pagetable memory holding the SEPT entry? Is that
also supposed to be leaked?

>
>
> > 2) Is it reliable to continue executing the host kernel and other
> > normal VMs once such bugs are hit?
> If with TDX holding the page ref count, the impact of unmapping failure of guest
> pages is just to leak those pages.
>
> > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > and cleaned up right away?
> As in the above flow, TDX needs to hold the page reference on unmapping failure
> until after reclaiming is successful. Well, reclaiming itself is possible to
> fail either.
>
> So, below is my proposal. Showed in the simple POC code based on
> https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
>
> Patch 1: TDX increases page ref count on unmap failure.

This will not work as Ackerley pointed out earlier [1], it will be
impossible to differentiate between transient refcounts on private
pages and extra refcounts of private memory due to TDX unmap failure.

[1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/

> Patch 2: Bail out private-to-shared conversion if splitting fails.
> Patch 3: Make kvm_gmem_zap() return void.
>
> ...
>         /*
>
>
> If the above changes are agreeable, we could consider a more ambitious approach:
> introducing an interface like:
>
> int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

I don't see any reason to introduce full tracking of gfn mapping
status in SEPTs just to handle very rare scenarios which KVM/TDX are
taking utmost care to avoid.

That being said, I see value in letting guest_memfd know exact ranges
still being under use by the TDX module due to unmapping failures.
guest_memfd can take the right action instead of relying on refcounts.

Does KVM continue unmapping the full range even after TDX SEPT
management fails to unmap a subrange?