linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aV4hAfPZXfKKB+7i@yzhao56-desk.sh.intel.com>
Date: Wed, 7 Jan 2026 17:03:41 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, Vishal Annapurve
	<vannapurve@...gle.com>, <pbonzini@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <michael.roth@....com>, <david@...nel.org>,
	<sagis@...gle.com>, <vbabka@...e.cz>, <thomas.lendacky@....com>,
	<nik.borisov@...e.com>, <pgonda@...gle.com>, <fan.du@...el.com>,
	<jun.miao@...el.com>, <francescolavra.fl@...il.com>, <jgross@...e.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<kai.huang@...el.com>, <binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>,
	<chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@...gle.com> writes:
> > 
> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> Vishal Annapurve <vannapurve@...gle.com> writes:
> > >>
> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >> >>
> > >> >> - EPT mapping size and folio size
> > >> >>
> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> >>
> > >>
> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> this rule imposed?
> > >
> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >
> > >> Is this rule there just because traditionally folio sizes also define the
> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> size in case the block of memory represented by the folio is not contiguous?
> > >
> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >
> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> tracking purposese on private to shared conversion), the memory is still
> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> the concerns?
> > >
> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > > spanning mixed ranges, i.e. with multiple folios.
> > 
> > The folio can be split if any (or all) of the pages in a huge page range
> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > would be split, and the split folios are necessary for tracking users of
> > shared pages using struct page refcounts.
> 
> Ahh, that's what the refcounting was referring to.  Gotcha.
> 
> > However the split folios in that 1G range are still fully contiguous.
> > 
> > The process of conversion will split the EPT entries soon after the
> > folios are split so the rule remains upheld.
Overall, I don't think allowing folios smaller than the mappings while
conversion is in progress brings enough benefit.

Cons:
(1) TDX's zapping callback has no idea whether the zapping is caused by an
    in-progress private-to-shared conversion or other reasons. It also has no
    idea if the attributes of the underlying folios remain unchanged during an
    in-progress private-to-shared conversion. Even if the assertion Ackerley
    mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
    callback for in-progress private-to-shared conversion alone (which would
    increase TDX's dependency on guest_memfd's specific implementation even if
    it's feasible).

    Removing the sanity checks entirely in TDX's zapping callback is confusing
    and would show a bad/false expectation from KVM -- what if a huge folio is
    incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
    others) in other conditions? And then do we still need the check in TDX's
    mapping callback? If not, does it mean TDX huge pages can stop relying on
    guest_memfd's ability to allocate huge folios, as KVM could still create
    huge mappings as long as small folios are physically contiguous with
    homogeneous memory attributes?

(2) Allowing folios smaller than the mapping would require splitting S-EPT in
    kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
    invalidate lock held in __kvm_gmem_set_attributes() could guard against
    concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
    error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).

Pro: Preventing zapping private memory until conversion is successful is good.

However, could we achieve this benefit in other ways? For example, is it
possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
optimization? (hugetlb_vmemmap conversion is super slow according to my
observation and I always disable it). Or pre-allocation for
vmemmap_remap_alloc()?

Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
private memory before conversion succeeds is still better than introducing the
mess between folio size and mapping size.

> > I guess perhaps the question is, is it okay if the folios are smaller
> > than the mapping while conversion is in progress? Does the order matter
> > (split page table entries first vs split folios first)?
> 
> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> multiple folios in one page, probably comes down to which option provides "good
> enough" performance without incurring too much complexity.