linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aWhaK+ikw8QkH4hU@yzhao56-desk.sh.intel.com>
Date: Thu, 15 Jan 2026 11:08:27 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: Sean Christopherson <seanjc@...gle.com>, Vishal Annapurve
	<vannapurve@...gle.com>, <pbonzini@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <michael.roth@....com>, <david@...nel.org>,
	<sagis@...gle.com>, <vbabka@...e.cz>, <thomas.lendacky@....com>,
	<nik.borisov@...e.com>, <pgonda@...gle.com>, <fan.du@...el.com>,
	<jun.miao@...el.com>, <francescolavra.fl@...il.com>, <jgross@...e.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<kai.huang@...el.com>, <binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>,
	<chao.gao@...el.com>
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> >> both primary and secondary MMUs, and check folio refcount before performing
> >> folio splitting?
> >
> > Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> > if something is going through the effort of splitting a folio, then odds are very,
> > very good that the new folios can't be safely mapped as a contiguous hugepage.
> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> > MM to implement, and for drivers/users to understand.
> >
> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> > communicate the maximum mapping size; folios are the "currency" for doing so.
> >
> > And then for edge cases that want to map a split folio as a hugepage (if any such
> > edge cases exist), thus take on the responsibility of managing the lifecycle of
> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
> >
> 
> Here's my understanding, hope it helps: there might also be a
> practical/simpler reason for first unmapping then check refcounts, and
> then splitting folios, and guest_memfd kind of does the same thing.
> 
> Folio splitting races with lots of other things in the kernel, and the
> folio lock isn't super useful because the lock itself is going to be
> split up.
> 
> Folio splitting wants all users to stop using this folio, so one big
> source of users is mappings. Hence, get those mappers (both primary and
> secondary MMUs) to unmap.
> 
> Core-mm-managed mappings take a refcount, so those refcounts go away. Of
> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
> unmap as requested, so that still falls in line with "stop using this
> folio".
> 
> I think the refcounting check isn't actually necessary if all users of
> folios STOP using the folio on request (via mmu notifiers or
> otherwise). Unfortunately, there are other users other than mappers. The
> best way to find these users is to check the refcount. The refcount
> check is asking "how many other users are left?" and if the number of
> users is as expected (just the filemap, or whatever else is expected),
> then splitting can go ahead, since the splitting code is now confident
> the remaining users won't try and use the folio metadata while splitting
> is happening.
> 
> 
> guest_memfd does a modified version of that on shared to private
> conversions. guest_memfd will unmap from host userspace page tables for
> the same reason, mainly to tell all the host users to unmap. The
> unmapping also triggers mmu notifiers so the stage 2 mappings also go
> away (TBD if this should be skipped) and this is okay because they're
> shared pages. guest usage will just map them back in on any failure and
> it doesn't break guests.
> 
> At this point all the mappers are gone, then guest_memfd checks
> refcounts to make sure that guest_memfd itself is the only user of the
> folio. If the refcount is as expected, guest_memfd is confident to
> continue with splitting folios, since other folio accesses will be
> locked out by the filemap invalidate lock.
> 
> The one main guest_memfd folio user that won't go away on an unmap call
> is if the folios get pinned for IOMMU access. In this case, guest_memfd
> fails the conversion and returns an error to userspace so userspace can
> sort out the IOMMU unpinning.
> 
> 
> As for private to shared conversions, folio merging would require the
> same thing that nobody else is using the folios (the folio
> metadata). guest_memfd skips that check because for private memory, KVM
> is the only other user, and guest_memfd knows KVM doesn't use folio
> metadata once the memory is mapped for the guest.
Ok. That makes sense. Thanks for the explanation.
It looks like guest_memfd also rules out concurrent folio metadata access by
holding the filemap_invalidate_lock.

BTW: Could that potentially cause guest soft lockup due to holding the
filemap_invalidate_lock for too long?