linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com>
Date: Tue, 17 Jun 2025 01:09:05 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, pbonzini@...hat.com, seanjc@...gle.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kirill.shutemov@...el.com, 
	tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > >
> > > > > We need to restore to the previous status (which includes the host page table)
> > > > > if conversion can't be done.
> > > > > That said, in my view, a better flow would be:
> > > > >
> > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > >
> > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > >    allocation and page splitting.
> > > > >
> > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > >    proceeds by sending the actual invalidation request.
> > > > >
> > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > >    the page or elevate the page reference count.
> > > >
> > > > Few questions here:
> > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > be due to bugs in the KVM/TDX module,
> > > Yes.
> > >
> > > > how reliable would it be to
> > > > continue executing TDX VMs on the host once such bugs are hit?
> > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > SEPT (after the unmapping failure).
> > > The teardown flow for TDX VM is:
> > >
> > > do_exit
> > >   |->exit_files
> > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > >      |->release kvmfd
> > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > >
> > > Without holding page reference after (1) fails, the guest pages may have been
> > > re-assigned by the host OS while they are still still tracked in the TDX module.
> >
> > What happens to the pagetable memory holding the SEPT entry? Is that
> > also supposed to be leaked?
> It depends on if the reclaiming of the page table pages holding the SEPT entry
> fails. If it is, it will be also leaked.
> But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> will fail after (1) fails.
>

Ok. Few questions that I would like to touch base briefly on:
i) If (1) fails and then VM is marked as bugged, will the TDX module
actually access that page in context of the same VM again?
ii) What all resources should remain unreclaimed if (1) fails?
     * page backing SEPT entry
     * page backing PAMT entry
     * TDMR
    If TDMR is the only one that fails to reclaim, will the TDX module
actually access the physical memory ever after the VM is cleaned up?
Otherwise, should all of these be made unreclaimable?
iii) Will it be safe for the host to use that memory by proper
WBINVD/memory clearing sequence if TDX module/TD is not going to use
that memory?

>
>
> > > > 2) Is it reliable to continue executing the host kernel and other
> > > > normal VMs once such bugs are hit?
> > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > pages is just to leak those pages.
> > >
> > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > and cleaned up right away?
> > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > fail either.
> > >
> > > So, below is my proposal. Showed in the simple POC code based on
> > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > >
> > > Patch 1: TDX increases page ref count on unmap failure.
> >
> > This will not work as Ackerley pointed out earlier [1], it will be
> > impossible to differentiate between transient refcounts on private
> > pages and extra refcounts of private memory due to TDX unmap failure.
> Hmm. why are there transient refcounts on private pages?
> And why should we differentiate the two?

Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].

Speculative/transient refcounts came up a few times In the context of
guest_memfd discussions, some examples include: pagetable walkers,
page migration, speculative pagecache lookups, GUP-fast etc. David H
can provide more context here as needed.

Effectively some core-mm features that are present today or might land
in the future can cause folio refcounts to be grabbed for short
durations without actual access to underlying physical memory. These
scenarios are unlikely to happen for private memory but can't be
discounted completely.

Another reason to avoid relying on refcounts is to not block usage of
raw physical memory unmanaged by kernel (without page structs) to back
guest private memory as we had discussed previously. This will help
simplify merge/split operations during conversions and help usecases
like guest memory persistence [2] and non-confidential VMs.

[1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
[2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/

>
>
> > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> >
> > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > Patch 3: Make kvm_gmem_zap() return void.
> > >
> > > ...
> > >         /*
> > >
> > >
> > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > introducing an interface like:
> > >
> > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> >
> > I don't see any reason to introduce full tracking of gfn mapping
> > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > taking utmost care to avoid.
> >
> > That being said, I see value in letting guest_memfd know exact ranges
> > still being under use by the TDX module due to unmapping failures.
> > guest_memfd can take the right action instead of relying on refcounts.
> >
> > Does KVM continue unmapping the full range even after TDX SEPT
> > management fails to unmap a subrange?
> Yes, if there's no bug in KVM, it will continue unmapping the full ranges.