linux-kernel - Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEvNRgGrPr3f9qpfW3KHx-fFLqYOL4u2pQkMUDqfC2-Lh63ePQ@mail.gmail.com>
Date: Thu, 15 Jan 2026 10:13:58 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Sean Christopherson <seanjc@...gle.com>, Vishal Annapurve <vannapurve@...gle.com>, pbonzini@...hat.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kas@...nel.org, 
	tabba@...gle.com, michael.roth@....com, david@...nel.org, sagis@...gle.com, 
	vbabka@...e.cz, thomas.lendacky@....com, nik.borisov@...e.com, 
	pgonda@...gle.com, fan.du@...el.com, jun.miao@...el.com, 
	francescolavra.fl@...il.com, jgross@...e.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, kai.huang@...el.com, 
	binbin.wu@...ux.intel.com, chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@...gle.com> writes:
>> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
>> >> both primary and secondary MMUs, and check folio refcount before performing
>> >> folio splitting?
>> >
>> > Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
>> > if something is going through the effort of splitting a folio, then odds are very,
>> > very good that the new folios can't be safely mapped as a contiguous hugepage.
>> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
>> > MM to implement, and for drivers/users to understand.
>> >
>> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
>> > communicate the maximum mapping size; folios are the "currency" for doing so.
>> >
>> > And then for edge cases that want to map a split folio as a hugepage (if any such
>> > edge cases exist), thus take on the responsibility of managing the lifecycle of
>> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
>> >
>>
>> Here's my understanding, hope it helps: there might also be a
>> practical/simpler reason for first unmapping then check refcounts, and
>> then splitting folios, and guest_memfd kind of does the same thing.
>>
>> Folio splitting races with lots of other things in the kernel, and the
>> folio lock isn't super useful because the lock itself is going to be
>> split up.
>>
>> Folio splitting wants all users to stop using this folio, so one big
>> source of users is mappings. Hence, get those mappers (both primary and
>> secondary MMUs) to unmap.
>>
>> Core-mm-managed mappings take a refcount, so those refcounts go away. Of
>> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
>> unmap as requested, so that still falls in line with "stop using this
>> folio".
>>
>> I think the refcounting check isn't actually necessary if all users of
>> folios STOP using the folio on request (via mmu notifiers or
>> otherwise). Unfortunately, there are other users other than mappers. The
>> best way to find these users is to check the refcount. The refcount
>> check is asking "how many other users are left?" and if the number of
>> users is as expected (just the filemap, or whatever else is expected),
>> then splitting can go ahead, since the splitting code is now confident
>> the remaining users won't try and use the folio metadata while splitting
>> is happening.
>>
>>
>> guest_memfd does a modified version of that on shared to private
>> conversions. guest_memfd will unmap from host userspace page tables for
>> the same reason, mainly to tell all the host users to unmap. The
>> unmapping also triggers mmu notifiers so the stage 2 mappings also go
>> away (TBD if this should be skipped) and this is okay because they're
>> shared pages. guest usage will just map them back in on any failure and
>> it doesn't break guests.
>>
>> At this point all the mappers are gone, then guest_memfd checks
>> refcounts to make sure that guest_memfd itself is the only user of the
>> folio. If the refcount is as expected, guest_memfd is confident to
>> continue with splitting folios, since other folio accesses will be
>> locked out by the filemap invalidate lock.
>>
>> The one main guest_memfd folio user that won't go away on an unmap call
>> is if the folios get pinned for IOMMU access. In this case, guest_memfd
>> fails the conversion and returns an error to userspace so userspace can
>> sort out the IOMMU unpinning.
>>
>>
>> As for private to shared conversions, folio merging would require the
>> same thing that nobody else is using the folios (the folio
>> metadata). guest_memfd skips that check because for private memory, KVM
>> is the only other user, and guest_memfd knows KVM doesn't use folio
>> metadata once the memory is mapped for the guest.
> Ok. That makes sense. Thanks for the explanation.
> It looks like guest_memfd also rules out concurrent folio metadata access by
> holding the filemap_invalidate_lock.
>
> BTW: Could that potentially cause guest soft lockup due to holding the
> filemap_invalidate_lock for too long?

Yes, potentially. You mean because the vCPUs are all blocked on page
faults, right? We can definitely optimize later, perhaps lock by
guest_memfd index ranges.