[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49c337d247940e8bd3920e5723c2fa710cd0dd83.camel@intel.com>
Date: Fri, 29 Aug 2025 19:53:24 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
<seanjc@...gle.com>
CC: "Huang, Kai" <kai.huang@...el.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Zhao, Yan Y"
<yan.y.zhao@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "michael.roth@....com"
<michael.roth@....com>
Subject: Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in
S-EPT management
On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@...el.com>
>
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated. See the WARN in
> kvm_gmem_migrate_folio().
>
> Eliminating TDX's explicit pinning will also enable guest_memfd to support
> in-place conversion between shared and private memory[1][2]. Because KVM
> cannot distinguish between speculative/transient refcounts and the
> intentional refcount for TDX on private pages[3], failing to release
> private page refcount in TDX could cause guest_memfd to indefinitely wait
> on decreasing the refcount for the splitting.
>
> Under normal conditions, not holding an extra page refcount in TDX is safe
> because guest_memfd ensures pages are retained until its invalidation
> notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
> module, not holding an extra refcount when a page is mapped in S-EPT could
> result in a page being released from guest_memfd while still mapped in the
> S-EPT. But, doing work to make a fatal error slightly less fatal is a net
> negative when that extra work adds complexity and confusion.
>
> Several approaches were considered to address the refcount issue, including
> - Attempting to modify the KVM unmap operation to return a failure,
> which was deemed too complex and potentially incorrect[4].
> - Increasing the folio reference count only upon S-EPT zapping failure[5].
> - Use page flags or page_ext to indicate a page is still used by TDX[6],
> which does not work for HVO (HugeTLB Vmemmap Optimization).
> - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
>
> Due to the complexity or inappropriateness of these approaches, and the
> fact that S-EPT zapping failure is currently only possible when there are
> bugs in the KVM or TDX module, which is very rare in a production kernel,
> a straightforward approach of simply not holding the page reference count
> in TDX was chosen[8].
>
> When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> vCPUs and mark the VM as dead. Although there is a potential window that a
> private page mapped in the S-EPT could be reallocated and used outside the
> VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> information.
>
Yea, in the case of a bug, there could be a use-after-free. This logic applies
to all code that has allocations including the entire KVM MMU. But in this case,
we can actually catch the use-after-free scenario under scrutiny and not have it
happen silently, which does not apply to all code. But the special case here is
that the use-after-free depends on TDX module logic which is not part of the
kernel.
Yan, can you clarify what you mean by "there could be a small window"? I'm
thinking this is a hypothetical window around vm_dead races? Or more concrete? I
*don't* want to re-open the debate on whether to go with this approach, but I
think this is a good teaching edge case to settle on how we want to treat
similar issues. So I just want to make sure we have the justification right.
> To be robust against bugs, the user can enable panic_on_warn
> as normal.
>
> Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
> Link: https://youtu.be/UnBKahkAon4 [2]
> Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
> Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
> Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
> Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
> Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
> Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
> Suggested-by: Vishal Annapurve <vannapurve@...gle.com>
> Suggested-by: Ackerley Tng <ackerleytng@...gle.com>
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> Reviewed-by: Ira Weiny <ira.weiny@...el.com>
> Reviewed-by: Kai Huang <kai.huang@...el.com>
> [sean: extract out of hugepage series, massage changelog accordingly]
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
Discussion aside, Reviewed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
Powered by blists - more mailing lists