linux-kernel - Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aVzl4Pa34I/uo1pU@yzhao56-desk.sh.intel.com>
Date: Tue, 6 Jan 2026 18:37:20 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <ackerleytng@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vbabka@...e.cz>,
	<thomas.lendacky@....com>, <pgonda@...gle.com>, <zhiquan1.li@...el.com>,
	<fan.du@...el.com>, <jun.miao@...el.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH v2 03/23] x86/tdx: Enhance
 tdh_phymem_page_wbinvd_hkid() to invalidate huge pages

On Wed, Dec 31, 2025 at 11:37:26AM -0800, Vishal Annapurve wrote:
> On Tue, Dec 9, 2025 at 5:57 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > On Tue, Dec 09, 2025 at 05:30:54PM -0800, Vishal Annapurve wrote:
> > > On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > >
> > > > On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > > > > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > >
> > > > > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > >  {
> > > > > >         int tdx_level = pg_level_to_tdx_sept_level(level);
> > > > > >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > > +       struct folio *folio = page_folio(page);
> > > > > >         gpa_t gpa = gfn_to_gpa(gfn);
> > > > > >         u64 err, entry, level_state;
> > > > > >
> > > > > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > >                 return -EIO;
> > > > > >         }
> > > > > >
> > > > > > -       err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > > > > -
> > > > > > +       err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > > > > +                                         folio_page_idx(folio, page),
> > > > > > +                                         KVM_PAGES_PER_HPAGE(level));
> > > > >
> > > > > This code seems to assume that folio_order() always matches the level
> > > > > at which it is mapped in the EPT entries.
> > > > I don't think so.
> > > > Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> > > > Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> > > > <= folio_nr_pages(folio).
> > >
> > > Is the gfn passed to tdx_sept_drop_private_spte() always huge page
> > > aligned if mapping is at huge page granularity?
> > Yes.
> > The GFN passed to tdx_sept_set_private_spte() is huge page aligned in
> > kvm_tdp_mmu_map(). SEAMCALL TDH_MEM_PAGE_AUG will also fail otherwise.
> > The GFN passed to tdx_sept_remove_private_spte() comes from the same mapping
> > entry in the mirror EPT.
> >
> > > If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
> > > will return the same page and folio_order and folio_page_idx() will be
> > > zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
> > >
> > > If the expectation is that page_folio() will always point to a head
> > > page for given hugepage granularity mapping then that logic will not
> > > work correctly IMO.
> > The current logic is that:
> > 1. tdh_mem_page_aug() maps physical memory starting from the page at "start_idx"
> >    within a "folio" and spanning "npages" contiguous PFNs.
> >    (npages corresponds to the mapping level KVM_PAGES_PER_HPAGE(level)).
> >    e.g. it can map at level 2MB, starting from the 4MB offset in a folio of
> >    order 1GB.
> >
> > 2. if split occurs, the huge 2MB mapping will be split into 4KB ones, while the
> >    underlying folio remains 1GB.
> 
> Private to shared conversion flow discussed so far [1][2][3]:
> 1) Preallocate maple tree entries needed for conversion
> 2) Split filemap range being converted to 4K pages
> 3) Mark KVM MMU invalidation begin for the huge page aligned range
> 4) Zap KVM MMU entries for the converted range
> 5) Update maple tree entries to carry final attributes
> 6) Mark KVM MMU invalidation end for huge page aligned range
> 
> Possible addition of splitting cross boundary leafs with the above flow:
> 1) Preallocate maple tree entries needed for conversion
> 2) Split filemap range being converted to 4K pages
> 3) Mark KVM MMU invalidation begin for the huge page aligned range
> 4) Split KVM MMU private boundary leafs for converted range
> 5) Zap KVM MMU entries for the converted range
> 6) Update maple tree entries to carry final attributes
> 7) Mark KVM MMU invalidation end for huge page aligned range
> 
> Note that in both the above flows KVM MMU entries will get zapped
> after folio is split to 4K i.e. when tdx_sept_remove_private_spte()
> happens folio will be split but the EPT entry level will still be 2M
> and the assumption of EPT entries always being subset of folios will
> not hold true.
> 
> I think things might be simplified if KVM TDX stack always operates on
> the pages without assuming ranges being covered by "folios".
Let's discuss that in v3 series
https://lore.kernel.org/all/20260106101646.24809-1-yan.y.zhao@intel.com/

> [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
> [2] https://lore.kernel.org/kvm/diqzzf8oazh4.fsf@google.com/
> [3] https://github.com/googleprodkernel/linux-cc/blob/9ee2bd65cc9b63c871f8f49d217a7a70576a942d/virt/kvm/guest_memfd.c#L894
> 
> >    e.g. now the 0th 4KB mapping after split points to the 4MB offset in the
> >    1GB folio, and the 1st 4KB mapping points to the 4MB+4KB offset...
> >    The mapping level after split is 4KB.
> >
> > 3. tdx_sept_remove_private_spte() invokes tdh_mem_page_remove() and
> >    tdh_phymem_page_wbinvd_hkid().
> >    -The GFN is 2MB aligned and level is 2MB if split does not occur or
> >    -The GFN is 4KB aligned and level is 4KB if split has occurred.
> >    While the underlying folio remains 1GB, the folio_page_idx(folio, page)
> >    specifies the offset in the folio, and the npages corresponding to
> >    the mapping level is <= folio_nr_pages(folio).
> >
> >
> > > > [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> > > >
> > > > > IIUC guest_memfd can decide
> > > > > to split folios to 4K for the complete huge folio before zapping the
> > > > > hugepage EPT mappings. I think it's better to just round the pfn to
> > > > > the hugepage address based on the level they were mapped at instead of
> > > > > relying on the folio order.