linux-kernel - Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com>
Date: Wed, 31 Dec 2025 11:37:26 -0800
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kas@...nel.org, tabba@...gle.com, 
	ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, thomas.lendacky@....com, pgonda@...gle.com, 
	zhiquan1.li@...el.com, fan.du@...el.com, jun.miao@...el.com, 
	ira.weiny@...el.com, isaku.yamahata@...el.com, xiaoyao.li@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid()
 to invalidate huge pages

On Tue, Dec 9, 2025 at 5:57 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Tue, Dec 09, 2025 at 05:30:54PM -0800, Vishal Annapurve wrote:
> > On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > > > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > >
> > > > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > >  {
> > > > >         int tdx_level = pg_level_to_tdx_sept_level(level);
> > > > >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > +       struct folio *folio = page_folio(page);
> > > > >         gpa_t gpa = gfn_to_gpa(gfn);
> > > > >         u64 err, entry, level_state;
> > > > >
> > > > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > >                 return -EIO;
> > > > >         }
> > > > >
> > > > > -       err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > > > -
> > > > > +       err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > > > +                                         folio_page_idx(folio, page),
> > > > > +                                         KVM_PAGES_PER_HPAGE(level));
> > > >
> > > > This code seems to assume that folio_order() always matches the level
> > > > at which it is mapped in the EPT entries.
> > > I don't think so.
> > > Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> > > Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> > > <= folio_nr_pages(folio).
> >
> > Is the gfn passed to tdx_sept_drop_private_spte() always huge page
> > aligned if mapping is at huge page granularity?
> Yes.
> The GFN passed to tdx_sept_set_private_spte() is huge page aligned in
> kvm_tdp_mmu_map(). SEAMCALL TDH_MEM_PAGE_AUG will also fail otherwise.
> The GFN passed to tdx_sept_remove_private_spte() comes from the same mapping
> entry in the mirror EPT.
>
> > If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
> > will return the same page and folio_order and folio_page_idx() will be
> > zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
> >
> > If the expectation is that page_folio() will always point to a head
> > page for given hugepage granularity mapping then that logic will not
> > work correctly IMO.
> The current logic is that:
> 1. tdh_mem_page_aug() maps physical memory starting from the page at "start_idx"
>    within a "folio" and spanning "npages" contiguous PFNs.
>    (npages corresponds to the mapping level KVM_PAGES_PER_HPAGE(level)).
>    e.g. it can map at level 2MB, starting from the 4MB offset in a folio of
>    order 1GB.
>
> 2. if split occurs, the huge 2MB mapping will be split into 4KB ones, while the
>    underlying folio remains 1GB.

Private to shared conversion flow discussed so far [1][2][3]:
1) Preallocate maple tree entries needed for conversion
2) Split filemap range being converted to 4K pages
3) Mark KVM MMU invalidation begin for the huge page aligned range
4) Zap KVM MMU entries for the converted range
5) Update maple tree entries to carry final attributes
6) Mark KVM MMU invalidation end for huge page aligned range

Possible addition of splitting cross boundary leafs with the above flow:
1) Preallocate maple tree entries needed for conversion
2) Split filemap range being converted to 4K pages
3) Mark KVM MMU invalidation begin for the huge page aligned range
4) Split KVM MMU private boundary leafs for converted range
5) Zap KVM MMU entries for the converted range
6) Update maple tree entries to carry final attributes
7) Mark KVM MMU invalidation end for huge page aligned range

Note that in both the above flows KVM MMU entries will get zapped
after folio is split to 4K i.e. when tdx_sept_remove_private_spte()
happens folio will be split but the EPT entry level will still be 2M
and the assumption of EPT entries always being subset of folios will
not hold true.

I think things might be simplified if KVM TDX stack always operates on
the pages without assuming ranges being covered by "folios".

[1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
[2] https://lore.kernel.org/kvm/diqzzf8oazh4.fsf@google.com/
[3] https://github.com/googleprodkernel/linux-cc/blob/9ee2bd65cc9b63c871f8f49d217a7a70576a942d/virt/kvm/guest_memfd.c#L894

>    e.g. now the 0th 4KB mapping after split points to the 4MB offset in the
>    1GB folio, and the 1st 4KB mapping points to the 4MB+4KB offset...
>    The mapping level after split is 4KB.
>
> 3. tdx_sept_remove_private_spte() invokes tdh_mem_page_remove() and
>    tdh_phymem_page_wbinvd_hkid().
>    -The GFN is 2MB aligned and level is 2MB if split does not occur or
>    -The GFN is 4KB aligned and level is 4KB if split has occurred.
>    While the underlying folio remains 1GB, the folio_page_idx(folio, page)
>    specifies the offset in the folio, and the npages corresponding to
>    the mapping level is <= folio_nr_pages(folio).
>
>
> > > [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> > >
> > > > IIUC guest_memfd can decide
> > > > to split folios to 4K for the complete huge folio before zapping the
> > > > hugepage EPT mappings. I think it's better to just round the pfn to
> > > > the hugepage address based on the level they were mapped at instead of
> > > > relying on the folio order.