linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFE8H0J7w+JEnxSq@yzhao56-desk.sh.intel.com>
Date: Tue, 17 Jun 2025 17:57:51 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, <pbonzini@...hat.com>,
	<seanjc@...gle.com>, <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
	<x86@...nel.org>, <rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vbabka@...e.cz>,
	<jroedel@...e.de>, <thomas.lendacky@....com>, <pgonda@...gle.com>,
	<zhiquan1.li@...el.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, Jun 17, 2025 at 01:09:05AM -0700, Vishal Annapurve wrote:
> On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > >
> > > > > > We need to restore to the previous status (which includes the host page table)
> > > > > > if conversion can't be done.
> > > > > > That said, in my view, a better flow would be:
> > > > > >
> > > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > > >
> > > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > > >    allocation and page splitting.
> > > > > >
> > > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > > >    proceeds by sending the actual invalidation request.
> > > > > >
> > > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > > >    the page or elevate the page reference count.
> > > > >
> > > > > Few questions here:
> > > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > > be due to bugs in the KVM/TDX module,
> > > > Yes.
> > > >
> > > > > how reliable would it be to
> > > > > continue executing TDX VMs on the host once such bugs are hit?
> > > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > > SEPT (after the unmapping failure).
> > > > The teardown flow for TDX VM is:
> > > >
> > > > do_exit
> > > >   |->exit_files
> > > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > > >      |->release kvmfd
> > > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > > >
> > > > Without holding page reference after (1) fails, the guest pages may have been
> > > > re-assigned by the host OS while they are still still tracked in the TDX module.
> > >
> > > What happens to the pagetable memory holding the SEPT entry? Is that
> > > also supposed to be leaked?
> > It depends on if the reclaiming of the page table pages holding the SEPT entry
> > fails. If it is, it will be also leaked.
> > But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> > will fail after (1) fails.
> >
> 
> Ok. Few questions that I would like to touch base briefly on:
> i) If (1) fails and then VM is marked as bugged, will the TDX module
> actually access that page in context of the same VM again?
In TDX module, the TD is marked as TD_TEARDOWN after step (2) when hkid is
released successfully.
Before that, TD is able to access the pages even if it is marked as buggy by KVM.

After TD is marked as TD_TEARDOWN, since (1) fails, the problematic guest
private pages are still tracked in the PAMT entries.
So, re-assignment the same PFN to other TDs will fail.

> ii) What all resources should remain unreclaimed if (1) fails?
>      * page backing SEPT entry
>      * page backing PAMT entry
>      * TDMR
>     If TDMR is the only one that fails to reclaim, will the TDX module
> actually access the physical memory ever after the VM is cleaned up?
> Otherwise, should all of these be made unreclaimable?
>From my understanding, they are
- guest private pages
- TDR page
- PAMT entries for guest private pages and TDR page


> iii) Will it be safe for the host to use that memory by proper
> WBINVD/memory clearing sequence if TDX module/TD is not going to use
> that memory?
I'm not sure. But it should be impossible for host to re-assign the pages to
other TDs as long as PAMT entries are not updated.


> > > > > 2) Is it reliable to continue executing the host kernel and other
> > > > > normal VMs once such bugs are hit?
> > > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > > pages is just to leak those pages.
> > > >
> > > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > > and cleaned up right away?
> > > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > > fail either.
> > > >
> > > > So, below is my proposal. Showed in the simple POC code based on
> > > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > > >
> > > > Patch 1: TDX increases page ref count on unmap failure.
> > >
> > > This will not work as Ackerley pointed out earlier [1], it will be
> > > impossible to differentiate between transient refcounts on private
> > > pages and extra refcounts of private memory due to TDX unmap failure.
> > Hmm. why are there transient refcounts on private pages?
> > And why should we differentiate the two?
> 
> Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> 
> Speculative/transient refcounts came up a few times In the context of
> guest_memfd discussions, some examples include: pagetable walkers,
> page migration, speculative pagecache lookups, GUP-fast etc. David H
> can provide more context here as needed.
GUP-fast only walks page tables for shared memory?
Can other walkers get a private folio by walking shared mappings?

On those speculative/transient refcounts came up, can't the 
kvm_gmem_convert_should_proceed() wait in an interruptible way before returning
failure?

The wait will anyway happen after the conversion is started, i.e.,
in filemap_remove_folio_for_restructuring().
       while (!folio_ref_freeze(folio, filemap_refcount)) {
                /*
                 * At this point only filemap refcounts are expected, hence okay
                 * to spin until speculative refcounts go away.
                 */
                WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
        }


BTW, I noticed that there's no filemap_invalidate_lock_shared() in
kvm_gmem_fault_shared() in 
https://lore.kernel.org/all/20250611133330.1514028-9-tabba@google.com/.

Do you know why?

> Effectively some core-mm features that are present today or might land
> in the future can cause folio refcounts to be grabbed for short
> durations without actual access to underlying physical memory. These
> scenarios are unlikely to happen for private memory but can't be
> discounted completely.
> 
> Another reason to avoid relying on refcounts is to not block usage of
> raw physical memory unmanaged by kernel (without page structs) to back
> guest private memory as we had discussed previously. This will help
> simplify merge/split operations during conversions and help usecases
> like guest memory persistence [2] and non-confidential VMs.
Ok.
Currently, "letting guest_memfd know exact ranges still being under use by the
TDX module due to unmapping failures" is good enough for TDX, though full
tracking of each GFN is even better.


> [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> 
> >
> >
> > > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> > >
> > > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > > Patch 3: Make kvm_gmem_zap() return void.
> > > >
> > > > ...
> > > >         /*
> > > >
> > > >
> > > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > > introducing an interface like:
> > > >
> > > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> > >
> > > I don't see any reason to introduce full tracking of gfn mapping
> > > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > > taking utmost care to avoid.
> > >
> > > That being said, I see value in letting guest_memfd know exact ranges
> > > still being under use by the TDX module due to unmapping failures.
> > > guest_memfd can take the right action instead of relying on refcounts.
> > >
> > > Does KVM continue unmapping the full range even after TDX SEPT
> > > management fails to unmap a subrange?
> > Yes, if there's no bug in KVM, it will continue unmapping the full ranges.