linux-kernel - Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aTjS/c8c5wNZcSgO@yzhao56-desk.sh.intel.com>
Date: Wed, 10 Dec 2025 09:55:09 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <ackerleytng@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vbabka@...e.cz>,
	<thomas.lendacky@....com>, <pgonda@...gle.com>, <zhiquan1.li@...el.com>,
	<fan.du@...el.com>, <jun.miao@...el.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH v2 03/23] x86/tdx: Enhance
 tdh_phymem_page_wbinvd_hkid() to invalidate huge pages

On Tue, Dec 09, 2025 at 05:30:54PM -0800, Vishal Annapurve wrote:
> On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > >
> > > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > >  {
> > > >         int tdx_level = pg_level_to_tdx_sept_level(level);
> > > >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > +       struct folio *folio = page_folio(page);
> > > >         gpa_t gpa = gfn_to_gpa(gfn);
> > > >         u64 err, entry, level_state;
> > > >
> > > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > >                 return -EIO;
> > > >         }
> > > >
> > > > -       err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > > -
> > > > +       err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > > +                                         folio_page_idx(folio, page),
> > > > +                                         KVM_PAGES_PER_HPAGE(level));
> > >
> > > This code seems to assume that folio_order() always matches the level
> > > at which it is mapped in the EPT entries.
> > I don't think so.
> > Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> > Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> > <= folio_nr_pages(folio).
> 
> Is the gfn passed to tdx_sept_drop_private_spte() always huge page
> aligned if mapping is at huge page granularity?
Yes.
The GFN passed to tdx_sept_set_private_spte() is huge page aligned in
kvm_tdp_mmu_map(). SEAMCALL TDH_MEM_PAGE_AUG will also fail otherwise.
The GFN passed to tdx_sept_remove_private_spte() comes from the same mapping
entry in the mirror EPT.

> If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
> will return the same page and folio_order and folio_page_idx() will be
> zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
> 
> If the expectation is that page_folio() will always point to a head
> page for given hugepage granularity mapping then that logic will not
> work correctly IMO.
The current logic is that:
1. tdh_mem_page_aug() maps physical memory starting from the page at "start_idx"
   within a "folio" and spanning "npages" contiguous PFNs.
   (npages corresponds to the mapping level KVM_PAGES_PER_HPAGE(level)).
   e.g. it can map at level 2MB, starting from the 4MB offset in a folio of
   order 1GB.

2. if split occurs, the huge 2MB mapping will be split into 4KB ones, while the
   underlying folio remains 1GB.
   e.g. now the 0th 4KB mapping after split points to the 4MB offset in the
   1GB folio, and the 1st 4KB mapping points to the 4MB+4KB offset...
   The mapping level after split is 4KB.

3. tdx_sept_remove_private_spte() invokes tdh_mem_page_remove() and
   tdh_phymem_page_wbinvd_hkid().
   -The GFN is 2MB aligned and level is 2MB if split does not occur or
   -The GFN is 4KB aligned and level is 4KB if split has occurred.
   While the underlying folio remains 1GB, the folio_page_idx(folio, page)
   specifies the offset in the folio, and the npages corresponding to
   the mapping level is <= folio_nr_pages(folio).


> > [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> >
> > > IIUC guest_memfd can decide
> > > to split folios to 4K for the complete huge folio before zapping the
> > > hugepage EPT mappings. I think it's better to just round the pfn to
> > > the hugepage address based on the level they were mapped at instead of
> > > relying on the folio order.