linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aE/1TgUvr0dcaJUg@yzhao56-desk.sh.intel.com>
Date: Mon, 16 Jun 2025 18:43:26 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: <vannapurve@...gle.com>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vbabka@...e.cz>,
	<jroedel@...e.de>, <thomas.lendacky@....com>, <pgonda@...gle.com>,
	<zhiquan1.li@...el.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Thu, Jun 05, 2025 at 02:12:58PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@...el.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@...el.com> writes:
> >> 
> >> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> >> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >> >> > ...
> >> >> > >
> >> >> > > I might be wrongly throwing out some terminologies here then.
> >> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> >> >> > > udmabuf seems to be working with pinned "folios" in the backend.
> >> >> > >
> >> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
> >> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> >> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> >> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> >> >> > memory.
> >> >> >
> >> >> > > shareability changes, the users will get notified and will have to
> >> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> >> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> >> >> > > special handling/lack of page structs.
> >> >> > My concern is a failable invalidation notifer may not be ideal.
> >> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> >> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
> >> >> > may fail the invalidation and the shareability change, even after other users
> >> >> > have successfully unmapped a range.
> >> >>
> >> >> Even if one user fails to invalidate its mappings, I don't see a
> >> >> reason to go ahead with shareability change. Shareability should not
> >> >> change unless all existing users let go of their soon-to-be-invalid
> >> >> view of memory.
> >> 
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> >> 
> >> We do need to handle errors across ranges to be converted, possibly from
> >> different memslots. The goal is to either have the entire conversion
> >> happen (including page split/merge) or nothing at all when the ioctl
> >> returns.
> >> 
> >> We try to undo the restructuring (whether split or merge) and undo any
> >> shareability changes on error (barring ENOMEM, in which case we leave a
> >> WARNing).
> > As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
> > can lead to WARNings in kernel with folios not being properly added to the
> > filemap.
> >
> 
> I'm not sure how else to handle errors on rollback path. I've hopefully
> addressed this on the other thread at [1].
I'll reply [1].

Please also check my reply and proposal at [2].

> >> The part we don't restore is the presence of the pages in the host or
> >> guest page tables. For that, our idea is that if unmapped, the next
> >> access will just map it in, so there's no issue there.
> >
> > I don't think so.
> >
> > As in patch 38 in [1], on failure, it may fail to
> > - restore the shareability
> > - restore the folio's filemap status
> > - restore the folio's hugetlb stash metadata
> > - restore the folio's merged/split status
> >
> 
> The plan is that we try our best to restore shareability, filemap,
> restructuring (aka split/merge, including stash metadata) other than
> failures on rollback.
> 
> > Also, the host page table is not restored.
> >
> >
> 
> This is by design, the host page tables can be re-populated on the next
> fault. I've hopefully addressed this on the other thread at [1].
This is not. Please check my reply to [1].


> >> > My thinking is that:
> >> >
> >> > 1. guest_memfd starts shared-to-private conversion
> >> > 2. guest_memfd sends invalidation notifications
> >> >    2.1 invalidate notification --> A --> Unmap and return success
> >> >    2.2 invalidate notification --> B --> Unmap and return success
> >> >    2.3 invalidate notification --> C --> return failure
> >> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
> >> >    shareability as shared
> >> >
> >> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> >> > 2.2. Even if additional notifications could be sent to A and B to ask for
> >> > mapping the GFN back, the map operation might fail. Consequently, A and B might
> >> > not be able to restore the mapped status of the GFN.
> >> 
> >> For conversion we don't attempt to restore mappings anywhere (whether in
> >> guest or host page tables). What do you think of not restoring the
> >> mappings?
> > It could cause problem if the mappings in S-EPT can't be restored.
> >
> > For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
> > kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
> > the GFN shareability is restored to private. The next guest access to
> > the partially unmapped private memory can meet a fatal error: "access before
> > acceptance".
> >
> > It could occur in such a scenario:
> > 1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
> > 2. Conversion fails in KVM.
> > 3. set_memory_decrypted() fails in TD.
> > 4. TD thinks the GFN is still accepted as private and accesses it.
> >
> >
> 
> This is true, I was thinking that this isn't handled solely in
> conversion but by being part of the contract between userspace VMM and
> the guest, that guest must handle conversion failures. I've hopefully
> addressed this on the other thread at [1].
> 
> >> > For IOMMU mappings, this
> >> > could result in DMAR failure following a failed attempt to do shared-to-private
> >> > conversion.
> >> 
> >> I believe the current conversion setup guards against this because after
> >> unmapping from the host, we check for any unexpected refcounts.
> > Right, it's fine if we check for any unexpected refcounts.
> >
> >
> >> (This unmapping is not the unmapping we're concerned about, since this is
> >> shared memory, and unmapping doesn't go through TDX.)
> >> 
> >> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
> >> are "unexpected". The conversion ioctl will return to userspace with an
> >> error.
> >> 
> >> IO can continue to happen, since the memory is still mapped in the
> >> IOMMU. The memory state is still shared. No issue there.
> >> 
> >> In RFCv2 [1], we expect userspace to see the error, then try and remove
> >> the memory from the IOMMU, and then try conversion again.
> > I don't think it's right to depend on that userspace could always perform in 
> > kernel's expected way, i.e. trying conversion until it succeeds.
> >
> 
> Let me think more deeply about this. Please let me know if there's
> anything I missed.
> 
> It is true that a buggy or malicious userspace VMM can ignore conversion
> failures and report success to the guest, but if both the userspace VMM
> and guest are malicious, it's quite hard for the kernel to defend
> against that.
Hmm, expecting userspace to try conversion endlessly exceeds what is reasonable
for a cooperative userspace?

> I think as long as there's no point where the guest can crash the host
> in a fixed way, I think it is okay to rely on a userspace VMM and guest
> protocol.
> 
> IIUC the guest can crash the host (original point of having guest_memfd)
> if the guest can convince the host to write to private memory. For that
How to?
Unless the host kernel wants to crash itself, I don't think allowing guest to
crash the host is acceptable.
If you happen to know one, please let us know. We'll fix it.

> to happen, the memory must be faulted into the Secure EPTs, and the
> shareability state must be ALL for the host to fault it in.
> 
> So to have this issue, the conversion failure must be such that the
> memory remains faulted into the Secure EPTs while shareability is
> shared. Since unmapping from secure EPTs happens pretty early before any
> shareability is changed or any rollback (and rollback failures) can
> happen, I think we should be quite safe?
It's not safe if unmapping from the secure EPT fails while the shareability is
changed to shared.


> If unmapping of private memory fails, this is where I think guest_memfd
> should get an error from the unmap and it should not proceed to change
> shareability.
Please check if my proposal at [2] is agreeable.

> 
> > We need to restore to the previous status (which includes the host page table)
> > if conversion can't be done.
> 
> Most of the previous status (shareability, filemap,
> restructuring (aka split/merge, including stash metadata)) are restored
> other than during rollback failures.
However, error during the rollback is unacceptable.


> As for presence in host page tables, is it okay to defer that till the
> next fault, and if not okay, why not?
If the host page tables involve only shared mappings in the primary MMU
and shared EPT, it's ok.


> For presence in guest page tables, is it okay to fall back on the
> protocol where the guest must handle conversion failures, and if not
> okay, why not?
Hmm, whether to roll back the guest page table or not after the conversion
failure is the business of the guest OS.

However, KVM can't rely on that the guest must assume that the page state is
shared even after a private-to-shared conversion failure.


> > That said, in my view, a better flow would be:
> >
> > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> >    consumers in kernel of memory allocated from guest_memfd).
> >
> > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> >    proceed. For example, in the case of TDX, this might involve memory
> >    allocation and page splitting.
> >
> > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> >    proceeds by sending the actual invalidation request.
> >
> > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> >    the page or elevate the page reference count.
> >
> > 5. guest_memfd completes the invalidation process. If the memory is marked as
> >    "poison," guest_memfd can handle it accordingly. If the page has an elevated
> >    reference count, guest_memfd may not need to take special action, as the
> >    elevated count prevents the OS from reallocating the page.
> >    (but from your reply below, seems a callback to guest_memfd is a better
> >    approach).
> >
> >
> 
> Thanks for this, I've tried to combine this into my response at
> [1]. I think this works, but it's hard because
> 
> a. Pre-checks are hard to check (explained at [1])
Please check if the pre-checks in my POC [2] is good.
I tested it for the case of TDX unmapping failure. It does not change the
shareabilitiy if splitting or zapping fails.


> b. Even after all the checks, unmapping can still fail, and those still
>    have to be handled, and to handle those, we have to buy into the
>    userspace VMM/guest protocol, so why not just buy into the protocol
>    to start with?
In my POC [2], the outcome of unmapping failure is to leak the pages.
Please check if it looks good to you.

> [1] https://lore.kernel.org/all/diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com/

[2] https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/

> >> The part in concern here is unmapping failures of private pages, for
> >> private-to-shared conversions, since that part goes through TDX and
> >> might fail.
> > IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
> > or TDX module.
> > So, for page splitting in S-EPT, I prefer to try splitting in the
> > pre-invalidation phase before conducting any real unmap.
> >
> >
> 
> Thanks for your detailed suggestion.
> 
> >> One other thing about taking refcounts is that in RFCv2,
> >> private-to-shared conversions assume that there are no refcounts on the
> >> private pages at all. (See filemap_remove_folio_for_restructuring() in
> >> [3])
> >>
> >> Haven't had a chance to think about all the edge cases, but for now I
> >> think on unmapping failure, in addition to taking a refcount, we should
> >> return an error at least up to guest_memfd, so that guest_memfd could
> >> perhaps keep the refcount on that page, but drop the page from the
> >> filemap. Another option could be to track messed up addresses and always
> >> check that on conversion or something - not sure yet.
> >
> > It looks good to me. See the bullet 4 in my proposed flow above.
> >
> 
> Thanks again for your detailed suggestion.
> 
> >> Either way, guest_memfd must know. If guest_memfd is not informed, on a
> >> next conversion request, the conversion will just spin in
> >> filemap_remove_folio_for_restructuring().
> > It makes sense.
> >
> >
> >> What do you think of this part about informing guest_memfd of the
> >> failure to unmap?
> > So, do you want to add a guest_memfd callback to achieve this purpose?
> >
> 
> I will need to think the entire thing through, but I meant informing as
> in returning an error to guest_memfd so that guest_memfd knows. I think
> returning an error should be the first cause of action.
> 
> As for whether guest_memfd should know how to handle the error or
> whether the userspace VMM should participate in deciding what to do with
> the error, I'm not sure. If you have suggestions on this, I hope we can
> combine the suggestions about the conversion protocol on the other thread.
> 
> Regarding a callback, are you thinking something like not having the
> unmap return an error, but instead TDX will call a function like
> kvm_gmem_error_at_offset(loff_t offset), and guest_memfd will then
> record that somewhere, and then immediately after calling unmap
> guest_memfd will check kvm_gmem_was_there_an_error_in_range() and then
> determining whether there's an error? Something like that?
> 
> I guess it could work but feels a little odd.
> 
> >
> > BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
> > and mmu_notifier_invalidate_range_start() fail, based on the repo
> > https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
> > 'hardening-v6.16-rc1-fix1-take2' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")
> 
> Thank you, I appreciate the effort you took to enumerate these. The
> following suggestions are based on my current understanding. I don't
> have time in the near future to do the plumbing to test out the
> suggestion, but for now I want to see if this suggestion makes sense,
> maybe you can correct any misunderstandings first. 

Sorry, I realized that the below enumeration brought confusion.

Listing them was to prove that unmapping failure is not expected by the kernel.

Please kindly let me know if any existing kernel code allows unmap to fail.

> > 1. Status of mmu notifier
> > -------------------------------
> > (1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
> >     1. clear_refs_write
> >     2. do_pagemap_scan
> >     3. uprobe_write_opcode
> >     4. do_huge_zero_wp_pmd
> >     5. __split_huge_pmd (N)
> >     6. __split_huge_pud (N)
> >     7. move_pages_huge_pmd
> >     8. copy_hugetlb_page_range
> >     9. hugetlb_unshare_pmds  (N)
> >     10. hugetlb_change_protection
> >     11. hugetlb_wp
> >     12. unmap_hugepage_range (N)
> >     13. move_hugetlb_page_tables
> >     14. collapse_huge_page
> >     15. retract_page_tables
> >     16. collapse_pte_mapped_thp
> >     17. write_protect_page
> >     18. replace_page
> >     19. madvise_free_single_vma
> >     20. wp_clean_pre_vma
> >     21. wp_page_copy 
> >     22. zap_page_range_single_batched (N)
> >     23. unmap_vmas (N)
> >     24. copy_page_range 
> >     25. remove_device_exclusive_entry
> >     26. migrate_vma_collect
> >     27. __migrate_device_pages
> >     28. change_pud_range 
> >     29. move_page_tables
> >     30. page_vma_mkclean_one
> >     31. try_to_unmap_one
> >     32. try_to_migrate_one
> >     33. make_device_exclusive
> >     34. move_pages_pte
> >
> > Of these 34 direct callers, those marked with (N) cannot tolerate
> > mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
> > 34 direct callers one by one, so the list of (N) is incomplete.
> >
> > For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
> > "Note that split_huge_pmd() doesn't have any limitations on refcounting:
> > pmd can be split at any point and never fails." This is because split_huge_pmd()
> > serves as a graceful fallback design for code walking pagetables but unaware
> > about huge pmds.
> >
> >

> Do these callers, especially those with (N), ever try to unmap any TDX
> private pages? guest_memfd only gives shared pages to core-mm, so for
> shared pages, there will continue to be no chance of errors.
> 
> If we change mmu_notifier_invalidate_range_start() to return an int, all
> of the callers that never invalidate shared pages can continue to safely
> rely on the fact that mmu_notifier_invalidate_range_start() will return
> 0.
mmu_notifier_invalidate_range_start() is just to zap shared pages.

 
> For the callers of mmu_notifier_invalidate_range_start() that may touch
> private pages, I believe that's only guest_memfd and KVM. That's where
> we want the error, and will handle the error.
> 
> Another point here is that I was thinking to put EPT splitting together
> with actual unmapping instead of with invalidation because we will
> probably invalidate more than we unmap (see explanation at [1] about the
> race). Maybe moving EPT splitting to unmap could help?
> 
> > (2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
> > __oom_reap_task_mm(), which only expects the error -EAGAIN.
> >
> > In mn_hlist_invalidate_range_start():
> > "WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"
> >
> >
> > (3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
> > case, they don't need to register mmu notifier.
> >
> > Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
> > notifier callbacks for the memory range. Then, upon receiving a notifier         
> > "invalidate range" callback , stop the device from using the range, and unpin    
> > the pages.
> >
> > See Documentation/core-api/pin_user_pages.rst.
> >
> >
> 
> Do you mean that we should teach device drivers to get callbacks for
> private pages? Are you looking ahead to handle TDX IO on private pages?
> So far we haven't handled that yet.
I tried to show that device drivers increases page refcount (by pinning) when it
maps a page into IOMMU page table. It does not decrease page refcount (by
unpinning) until after unmapping.

If the page hold by the device driver is allocated from hugetlb, and if the page
has been truncated from the hugetlb, the page is still hold by the device
driver until the page is unmapped in the IOMMU page table.

This is similar to TDX. As long as a page is still mapped in the SEPT or tracked
by the TDX module, it's better to hold a page refcount even after the page is
truncated from the file mapping.


> > 2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
> > -------------------------------
> > (1) Error fallback cases.
> >
> >     1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
> >        split_huge_pmd() is designed as a graceful fallback without failure.
> >
> >        split_huge_pmd
> >         |->__split_huge_pmd
> >            |->mmu_notifier_range_init
> >            |  mmu_notifier_invalidate_range_start
> >            |  split_huge_pmd_locked
> >            |  mmu_notifier_invalidate_range_end
> >
> >
> >     2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
> >        iomap_write_failed
> >          |->truncate_pagecache_range
> >             |->unmap_mapping_range
> >             |  |->unmap_mapping_pages
> >             |     |->unmap_mapping_range_tree
> >             |        |->unmap_mapping_range_vma
> >             |           |->zap_page_range_single
> >             |              |->zap_page_range_single_batched
> >             |                       |->mmu_notifier_range_init
> >             |                       |  mmu_notifier_invalidate_range_start
> >             |                       |  unmap_single_vma
> >             |                       |  mmu_notifier_invalidate_range_end
> >             |->truncate_inode_pages_range
> >                |->truncate_cleanup_folio
> >                   |->if (folio_mapped(folio))
> >                   |     unmap_mapping_folio(folio);
> >                          |->unmap_mapping_range_tree
> >                             |->unmap_mapping_range_vma
> >                                |->zap_page_range_single
> >                                   |->zap_page_range_single_batched
> >                                      |->mmu_notifier_range_init
> >                                      |  mmu_notifier_invalidate_range_start
> >                                      |  unmap_single_vma
> >                                      |  mmu_notifier_invalidate_range_end
> >
> >    3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
> >       remap_pfn_range_notrack
> >         |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
> >         |  if (!error)
> >         |      return 0;
> > 	|  zap_page_range_single
> >            |->zap_page_range_single_batched
> >               |->mmu_notifier_range_init
> >               |  mmu_notifier_invalidate_range_start
> >               |  unmap_single_vma
> >               |  mmu_notifier_invalidate_range_end
> >
> >    4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
> >       partial mappings on error.
> >
> >       perf_mmap
> >         |->ret = map_range(rb, vma);
> >                  |  err = remap_pfn_range
> >                  |->if (err) 
> >                  |     zap_page_range_single
> >                         |->zap_page_range_single_batched
> >                            |->mmu_notifier_range_init
> >                            |  mmu_notifier_invalidate_range_start
> >                            |  unmap_single_vma
> >                            |  mmu_notifier_invalidate_range_end
> >
> >
> >    5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.
> >
> >       __do_fault
> > 	|->if (unlikely(PageHWPoison(vmf->page))) { 
> > 	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
> > 	|	if (ret & VM_FAULT_LOCKED) {
> > 	|		if (page_mapped(vmf->page))
> > 	|			unmap_mapping_folio(folio);
> >         |                       |->unmap_mapping_range_tree
> >         |                          |->unmap_mapping_range_vma
> >         |                             |->zap_page_range_single
> >         |                                |->zap_page_range_single_batched
> >         |                                   |->mmu_notifier_range_init
> >         |                                   |  mmu_notifier_invalidate_range_start
> >         |                                   |  unmap_single_vma
> >         |                                   |  mmu_notifier_invalidate_range_end
> > 	|		if (mapping_evict_folio(folio->mapping, folio))
> > 	|			poisonret = VM_FAULT_NOPAGE; 
> > 	|		folio_unlock(folio);
> > 	|	}
> > 	|	folio_put(folio);
> > 	|	vmf->page = NULL;
> > 	|	return poisonret;
> > 	|  }
> >
> >
> >   6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
> >      partial mapping done by a device driver.
> >
> >      __mmap_new_vma
> >        |->__mmap_new_file_vma(map, vma);
> >           |->error = mmap_file(vma->vm_file, vma);
> >           |  if (error)
> >           |     unmap_region
> >                  |->unmap_vmas
> >                     |->mmu_notifier_range_init
> >                     |  mmu_notifier_invalidate_range_start
> >                     |  unmap_single_vma
> >                     |  mmu_notifier_invalidate_range_end
> >
> >
> 
> These should probably not ever be invalidating or unmapping private pages.
> 
> > (2) No-fail cases
> > -------------------------------
> > 1. iput() cannot fail. 
> >
> > iput
> >  |->iput_final
> >     |->WRITE_ONCE(inode->i_state, state | I_FREEING);
> >     |  inode_lru_list_del(inode);
> >     |  evict(inode);
> >        |->op->evict_inode(inode);
> >           |->shmem_evict_inode
> >              |->shmem_truncate_range
> >                 |->truncate_inode_pages_range
> >                    |->truncate_cleanup_folio
> >                       |->if (folio_mapped(folio))
> >                       |     unmap_mapping_folio(folio);
> >                             |->unmap_mapping_range_tree
> >                                |->unmap_mapping_range_vma
> >                                   |->zap_page_range_single
> >                                      |->zap_page_range_single_batched
> >                                         |->mmu_notifier_range_init
> >                                         |  mmu_notifier_invalidate_range_start
> >                                         |  unmap_single_vma
> >                                         |  mmu_notifier_invalidate_range_end
> >
> >
> > 2. exit_mmap() cannot fail
> >
> > exit_mmap
> >   |->mmu_notifier_release(mm);
> >      |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
> >         |->mmu_notifier_range_init
> >         |  mmu_notifier_invalidate_range_start
> >         |  unmap_single_vma
> >         |  mmu_notifier_invalidate_range_end
> >
> >
> 
> These should probably not ever be invalidating or unmapping private pages.
> 
> > 3. KVM Cases That Cannot Tolerate Unmap Failure
> > -------------------------------
> > Allowing unmap operations to fail in the following scenarios would make it very
> > difficult or even impossible to handle the failure:
> >
> > (1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
> > without expecting any failure.
> >
> > mmu_alloc_direct_roots
> >   |->mmu_alloc_root
> >      |->kvm_mmu_get_shadow_page
> >         |->__kvm_mmu_get_shadow_page
> >            |->kvm_mmu_alloc_shadow_page
> >               |->account_shadowed
> >                  |->kvm_mmu_slot_gfn_write_protect
> >                     |->kvm_tdp_mmu_write_protect_gfn
> >                        |->write_protect_gfn
> >                           |->tdp_mmu_iter_set_spte
> >
> >
> 
> I need to learn more about shadow pages but IIUC TDX doesn't use shadow
> pages so this path won't interact with unmapping private pages.
> 
> > (2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail
> >
> > kvm_vfio_release/kvm_vfio_file_del
> >  |->kvm_vfio_update_coherency
> >     |->kvm_arch_unregister_noncoherent_dma
> >        |->kvm_noncoherent_dma_assignment_start_or_stop
> >           |->kvm_zap_gfn_range
> >              |->kvm_tdp_mmu_zap_leafs
> >                 |->tdp_mmu_zap_leafs
> >                    |->tdp_mmu_iter_set_spte
> >
> >
> 
> I need to learn more about VFIO but for now IIUC IO uses shared pages,
> so this path won't interact with unmapping private pages.
> 
> > (3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
> > never expect failure of unmap.
> >
> > __kvm_set_or_clear_apicv_inhibit
> >   |->kvm_zap_gfn_range
> >      |->kvm_tdp_mmu_zap_leafs
> >         |->tdp_mmu_zap_leafs
> >            |->tdp_mmu_iter_set_spte
> >
> >
> >
> 
> There could be some TDX specific things such that TDX doesn't use this
> path.
tdp_mmu_iter_set_spte() is used by KVM generally to update the SPTE when
kvm->mmu_lock is held for write.

TDX uses tdp_mmu_iter_set_spte() to further unmapping the SEPT.

Converting tdp_mmu_iter_set_spte() to return error is a huge work and I don't
think it's right or worthwhile.

> 
> > 4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
> > write mmu_lock) failable.
> >
> > (1) kvm_vcpu_flush_tlb_guest()
> >
> > kvm_vcpu_flush_tlb_guest
> >   |->kvm_mmu_sync_roots
> >      |->mmu_sync_children
> >         |->kvm_vcpu_write_protect_gfn
> >            |->kvm_mmu_slot_gfn_write_protect
> >               |->kvm_tdp_mmu_write_protect_gfn
> >                  |->write_protect_gfn
> >                     |->tdp_mmu_iter_set_spte
> >                        |->tdp_mmu_set_spte
> >
> >
> > (2) handle_removed_pt() and handle_changed_spte().
> >
> 
> Thank you so much for looking into these, I'm hoping that the number of
> cases where TDX and private pages are unmapped are really limited to a
> few paths that we have to rework.
> 
> If we agree that the error has to be handled, then regardless of how we
> let the caller know that an error happened, all paths touching TDX
> private pages have to be reworked.
> 
> Between (1) returning an error vs (2) marking error and having the
> caller check for errors, then it's probably better to use the standard
> approach of returning an error since it is better understood, and
> there's no need to have extra data structures?
However, I don't think returning error during the unmap path is a standard
approach...