[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWWQq6tHkK+97SOB@yzhao56-desk.sh.intel.com>
Date: Tue, 13 Jan 2026 14:10:47 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, Sean Christopherson
<seanjc@...gle.com>, <pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
<kvm@...r.kernel.org>, <x86@...nel.org>, <rick.p.edgecombe@...el.com>,
<dave.hansen@...el.com>, <kas@...nel.org>, <tabba@...gle.com>,
<michael.roth@....com>, <david@...nel.org>, <sagis@...gle.com>,
<vbabka@...e.cz>, <thomas.lendacky@....com>, <nik.borisov@...e.com>,
<pgonda@...gle.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
<francescolavra.fl@...il.com>, <jgross@...e.com>, <ira.weiny@...el.com>,
<isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>, <kai.huang@...el.com>,
<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
On Mon, Jan 12, 2026 at 11:56:01AM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@...el.com> writes:
>
> >> > >> > I think the central question I have among all the above is what TDX
> >> > >> > needs to actually care about (putting aside what KVM's folio size/memory
> >> > >> > contiguity vs mapping level rule for a while).
> >> > >> >
> >> > >> > I think TDX code can check what it cares about (if required to aid
> >> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> >> > >> > or does it actually care about memory contiguity and alignment?
> >> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> >> > >
> >> > > In this slightly unusual case, I think the guarantee needed here is
> >> > > that as long as a range is mapped into SEPT entries, guest_memfd
> >> > > ensures that the complete range stays private.
> >> > >
> >> > > i.e. I think it should be safe to rely on guest_memfd here,
> >> > > irrespective of the folio sizes:
> >> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> >> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> >> > > entries, guest_memfd will not let host userspace mappings to access
> >> > > guest private memory.
> >> > >
> >> > >>
> >> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> >> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> >> > >> contiguous range larger than the page's folio range.
> >> > >
> >> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> >> > >
>
> Please let us know what you think of this too, why not parametrize using
> page and nr_pages?
With (struct page*, unsigned long nr_pages) pair, IMHO, a warning when the
entire range is not fully contained in a folio is still necessary.
I expressed the concern here:
https://lore.kernel.org/kvm/aWRfVOZpTUdYJ+7C@yzhao56-desk.sh.intel.com/
> >> > >>
> >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> >> > >> If smaller folios are allowed, splitting private mapping is required there.
> >> >
> >> > It was discussed before that for memory failure handling, we will want
> >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> >> > from the ground up), so we'll still need to figure out it's okay to let
> >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> >> > doesn't actually deal with poisoned folios on freeing, so there's more
> >> > work to do on the HugeTLB side.
> >> >
> >> > This is a good point, although IIUC it is a separate issue. The need to
> >> > split private mappings on memory failure is not for confidentiality in
> >> > the TDX sense but to ensure that the guest doesn't use the failed
> >> > memory. In that case, contiguity is broken by the failed memory. The
> >> > folio is split, the private EPTs are split. The folio size should still
> >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> >> > guest_memfd calls TDX code to split the EPTs.
> >>
> >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> >> folio splitting.
> >>
> >> This is better than performing folio splitting while it's mapped as huge in
> >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> >>
>
> Let's put memory failure handling aside for now since for now it zaps
> the entire huge page, so there's no impact on ordering between S-EPT and
> folio split.
Relying on guest_memfd's specific implemenation is not a good thing. e.g.,
Given there's a version of guest_memfd allocating folios from buddy.
1. KVM maps a 2MB folio in a 2MB mappings.
2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
2MB folio partially split.
3. Memory failure occurs on one of the split folio.
4. When splitting S-EPT fails, the over-zapping issue is still there.
> >> In the primary MMU, it follows the rule of unmapping a folio before splitting,
> >> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
> >> more ranges than necessary, maybe a trade-off is to always split S-EPT before
> >> allowing backend folio splitting.
> >>
>
> The mapping size <= folio size rule (for KVM and the primary MMU) is
> there because it is the safe way to map memory into the guest because a
> folio implies contiguity. Folios are basically a core MM concept so it
> makes sense that the primary MMU relies on that.
So, why the primary MMU needs to unmap and check refcount before folio
splitting?
> IIUC the core of the rule isn't folio sizes, it's memory
> contiguity. guest_memfd guarantees memory contiguity, and KVM should be
> able to rely on guest_memfd's guarantee, especially since guest_memfd is
> virtualiation-first, and KVM first.
>
> I think rules from the primary MMU are a good reference, but we
> shouldn't copy rules from the primary MMU, and KVM can rely on
> guest_memfd's guarantee of memory contiguity.
>
> >> Does this look good to you?
> > So, the flow of converting 0-4KB from private to shared in a 1GB folio in
> > guest_memfd is:
> >
> > a. If guest_memfd splits 1GB to 2MB first:
> > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
> > 2. split folio
> > 3. zap the 0-4KB mapping.
> >
> > b. If guest_memfd splits 1GB to 4KB directly:
> > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
> > 2. split folio
> > 3. zap the 0-4KB mapping.
> >
> > The flow of converting 0-2MB from private to shared in a 1GB folio in
> > guest_memfd is:
> >
> > a. If guest_memfd splits 1GB to 2MB first:
> > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
> > 2. split folio
> > 3. zap the 0-2MB mapping.
> >
> > b. If guest_memfd splits 1GB to 4KB directly:
> > 1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
> > 2. split folio
> > 3. zap the 0-2MB mapping.
> >
> >> So, to convert a 2MB range from private to shared, even though guest_memfd will
> >> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
> >> don't split the backend folio.
> >>
> >> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
> >> which matters little, especially after we support S-EPT promotion later.
> >>
>
> I didn't consider leaving split S-EPT mappings since there is a
> performance impact. Let me think about this a little.
>
> Meanwhile, if the folios are split before the S-EPTs are split, as long
> as huge folios worth of memory are guaranteed contiguous by guest_memfd
> for KVM, what are the problems you see?
Hmm. As the reply in
https://lore.kernel.org/kvm/aV4hAfPZXfKKB+7i@yzhao56-desk.sh.intel.com/,
there're pros and cons. I'll defer to maintainers' decision.
Powered by blists - more mailing lists