linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEvNRgGCpDniO2TFqY9cpCJ1Sf84tM_Q4pQCg0mNq25mEftTKw@mail.gmail.com>
Date: Mon, 12 Jan 2026 11:56:01 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>, Vishal Annapurve <vannapurve@...gle.com>, 
	Sean Christopherson <seanjc@...gle.com>, pbonzini@...hat.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kas@...nel.org, tabba@...gle.com, michael.roth@....com, 
	david@...nel.org, sagis@...gle.com, vbabka@...e.cz, thomas.lendacky@....com, 
	nik.borisov@...e.com, pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Yan Zhao <yan.y.zhao@...el.com> writes:

>> > >> > I think the central question I have among all the above is what TDX
>> > >> > needs to actually care about (putting aside what KVM's folio size/memory
>> > >> > contiguity vs mapping level rule for a while).
>> > >> >
>> > >> > I think TDX code can check what it cares about (if required to aid
>> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
>> > >> > or does it actually care about memory contiguity and alignment?
>> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
>> > >
>> > > In this slightly unusual case, I think the guarantee needed here is
>> > > that as long as a range is mapped into SEPT entries, guest_memfd
>> > > ensures that the complete range stays private.
>> > >
>> > > i.e. I think it should be safe to rely on guest_memfd here,
>> > > irrespective of the folio sizes:
>> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
>> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
>> > > entries, guest_memfd will not let host userspace mappings to access
>> > > guest private memory.
>> > >
>> > >>
>> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
>> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
>> > >> contiguous range larger than the page's folio range.
>> > >
>> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>> > >

Please let us know what you think of this too, why not parametrize using
page and nr_pages?

>> > >>
>> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
>> > >> If smaller folios are allowed, splitting private mapping is required there.
>> >
>> > It was discussed before that for memory failure handling, we will want
>> > to split huge pages, we will get to it! The trouble is that guest_memfd
>> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
>> > from the ground up), so we'll still need to figure out it's okay to let
>> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
>> > doesn't actually deal with poisoned folios on freeing, so there's more
>> > work to do on the HugeTLB side.
>> >
>> > This is a good point, although IIUC it is a separate issue. The need to
>> > split private mappings on memory failure is not for confidentiality in
>> > the TDX sense but to ensure that the guest doesn't use the failed
>> > memory. In that case, contiguity is broken by the failed memory. The
>> > folio is split, the private EPTs are split. The folio size should still
>> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
>> > guest_memfd calls TDX code to split the EPTs.
>>
>> Hmm, maybe the key is that we need to split S-EPT first before allowing
>> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
>> folio splitting.
>>
>> This is better than performing folio splitting while it's mapped as huge in
>> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
>> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
>> kvm_gmem_error_folio() would still trigger the over-zapping issue.
>>

Let's put memory failure handling aside for now since for now it zaps
the entire huge page, so there's no impact on ordering between S-EPT and
folio split.

>> In the primary MMU, it follows the rule of unmapping a folio before splitting,
>> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
>> more ranges than necessary, maybe a trade-off is to always split S-EPT before
>> allowing backend folio splitting.
>>

The mapping size <= folio size rule (for KVM and the primary MMU) is
there because it is the safe way to map memory into the guest because a
folio implies contiguity. Folios are basically a core MM concept so it
makes sense that the primary MMU relies on that.

IIUC the core of the rule isn't folio sizes, it's memory
contiguity. guest_memfd guarantees memory contiguity, and KVM should be
able to rely on guest_memfd's guarantee, especially since guest_memfd is
virtualiation-first, and KVM first.

I think rules from the primary MMU are a good reference, but we
shouldn't copy rules from the primary MMU, and KVM can rely on
guest_memfd's guarantee of memory contiguity.

>> Does this look good to you?
> So, the flow of converting 0-4KB from private to shared in a 1GB folio in
> guest_memfd is:
>
> a. If guest_memfd splits 1GB to 2MB first:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
>    2. split folio
>    3. zap the 0-4KB mapping.
>
> b. If guest_memfd splits 1GB to 4KB directly:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
>    2. split folio
>    3. zap the 0-4KB mapping.
>
> The flow of converting 0-2MB from private to shared in a 1GB folio in
> guest_memfd is:
>
> a. If guest_memfd splits 1GB to 2MB first:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
>    2. split folio
>    3. zap the 0-2MB mapping.
>
> b. If guest_memfd splits 1GB to 4KB directly:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
>    2. split folio
>    3. zap the 0-2MB mapping.
>
>> So, to convert a 2MB range from private to shared, even though guest_memfd will
>> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
>> don't split the backend folio.
>>
>> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
>> which matters little, especially after we support S-EPT promotion later.
>>

I didn't consider leaving split S-EPT mappings since there is a
performance impact. Let me think about this a little.

Meanwhile, if the folios are split before the S-EPTs are split, as long
as huge folios worth of memory are guaranteed contiguous by guest_memfd
for KVM, what are the problems you see?