[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRbYxOIWosU7RF1K@yzhao56-desk.sh.intel.com>
Date: Fri, 14 Nov 2025 15:22:44 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
<seanjc@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Li,
Xiaoyao" <xiaoyao.li@...el.com>, "Du, Fan" <fan.du@...el.com>, "Hansen, Dave"
<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "kas@...nel.org"
<kas@...nel.org>, "michael.roth@....com" <michael.roth@....com>, "Weiny, Ira"
<ira.weiny@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
"Peng, Chao P" <chao.p.peng@...el.com>, "Annapurve, Vishal"
<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"Miao, Jun" <jun.miao@...el.com>, "x86@...nel.org" <x86@...nel.org>,
"pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings
if a VMExit carries level info
On Tue, Nov 11, 2025 at 07:05:28PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> > TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> > to the current implementation of the TDX module, if a guest accepts a GFN
> > at a lower level after KVM maps it at a higher level, the TDX module will
> > emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> > error to the guest. If KVM fails to perform page splitting in the VMExit
> > handler, the guest's accept operation will be triggered again upon
> > re-entering the guest, causing a repeated EPT violation VMExit.
> >
> > The TDX module thus enables the EPT violation VMExit to carry the guest's
> > accept level when the VMExit is caused by the guest's accept operation.
> >
> > Therefore, in TDX's EPT violation handler
> > (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> > from mapping at a higher a level than the guest's accept level.
> >
> > (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> > splitting under the shared mmu_lock by TDX.
> >
> > Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> > perform the actual splitting under shared mmu_lock with enhanced TDX
> > modules, (1) is possible to be called under shared mmu_lock, and (2) would
> > become unnecessary.
> >
> > As an optimization, this patch calls hugepage_test_guest_inhibit() without
> > holding the mmu_lock to reduce the frequency of acquiring the write
> > mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> > is not already set. This is safe because the guest inhibit bit is set in a
> > one-way manner while the splitting under the write mmu_lock is performed
> > before setting the guest inhibit bit.
> >
> > Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> > Suggested-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> > Suggested-by: Sean Christopherson <seanjc@...gle.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> > ---
> > RFC v2
> > - Change tdx_get_accept_level() to tdx_check_accept_level().
> > - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> > to change KVM mapping level in a global way according to guest accept
> > level. (Rick, Sean).
> >
> > RFC v1:
> > - Introduce tdx_get_accept_level() to get guest accept level.
> > - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> > accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> > mapping level.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> > 2 files changed, 53 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 035d81275be4..71115058e5e6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> > return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> > }
> >
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > + struct kvm *kvm = vcpu->kvm;
> > + u64 eeq_type, eeq_info;
> > + int level = -1;
> > +
> > + if (!slot)
> > + return 0;
> > +
> > + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > + return 0;
> > +
> > + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > + level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> > + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> > + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> > + struct kvm_gfn_range gfn_range = {
> > + .start = base_gfn,
> > + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> > + .slot = slot,
> > + .may_block = true,
> > + .attr_filter = KVM_FILTER_PRIVATE,
> > + };
> > +
> > + scoped_guard(write_lock, &kvm->mmu_lock) {
> > + int ret;
> > +
> > + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> > + if (ret)
> > + return ret;
> > +
> > + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> > + if (level == PG_LEVEL_4K)
> > + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> > + }
> > + }
> > + }
>
> Also, could you also clarify what's the current behaviour when the exit
> doesn't have any level information?
An EPT violation exit seen by KVM for TDs is emulated by the TDX module. The TDX
module provides VMM with more detailed info through the exit's extended exit
qualification.
If an EPT violation exit is emulated due to the guest's ACCEPT operation, the
extended exit qualification is of type TDX_EXT_EXIT_QUAL_TYPE_ACCEPT. Since an
ACCEPT operation must provide a valid level (otherwise, the TDX module just
fails guest ACCEPT without exit to VMM), the extended exit qualification info
must carry a valid level too: either PG_LEVEL_4K or PG_LEVEL_2M.
So, if KVM sees an exit with no level info, the extended exit qualification is
not of type TDX_EXT_EXIT_QUAL_TYPE_ACCEPT in the first place. It could be of
type NONE or type PENDING_EPT_VIOLATION depending on whether the guest is
configured with pending_ve_disable or if the gpa is private. This kind of exit
is caused by guest accessing a memory without first accepting it.
> Will 'level == PG_LEVEL_4K' in this case? Or will this function return
> early right after check the eeq_type?
The function will return early right after check the eeq_type.
> It's not mentioned anywhere in the changelog. The cover letter vaguely
> says:
>
> This mechanism allows support of huge pages for non-Linux TDs and
> also removes the 4KB restriction on pre-fault mappings for Linux
> TDs in RFC v1.
>
> But it's not clear to me how this is solved.
I'll add a comment to tdx_check_accept_level() and update the patch log to make
the picture clearer.
Thanks for pointing it out.
Powered by blists - more mailing lists