linux-kernel - Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCbbkLM/AITIgzXs@yzhao56-desk.sh.intel.com>
Date: Fri, 16 May 2025 14:30:40 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen,
 Dave" <dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "michael.roth@....com"
	<michael.roth@....com>, "Weiny, Ira" <ira.weiny@...el.com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level
 according to vCPU's ACCEPT level

On Wed, May 14, 2025 at 05:20:01AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
> > Determine the max mapping level of a private GFN according to the vCPU's
> > ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
> > 
> > When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
> > before any actual memory access, the vCPU's ACCEPT level is available in
> > the extended exit qualification. Set the vCPU's ACCEPT level as the max
> > mapping level for the faulting GFN. This is necessary because if KVM
> > specifies a mapping level greater than the vCPU's ACCEPT level, and no
> > other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
> > produce another EPT violation on the vCPU after re-entering the TD, with
> > the vCPU's ACCEPT level indicated in the extended exit qualification.
> 
> Maybe a little more info would help. It's because the TDX module wants to
> "accept" the smaller size in the real S-EPT, but KVM created a huge page. It
> can't demote to do this without help from KVM.
Ok. Right, the TDX module cannot set the entire 2MB mapping to the accepted
state because the guest only specifies 4KB acceptance. The TDX module cannot
perform demotion without a request from KVM. Therefore, the requested level must
be passed to KVM to ensure the mirror page table faults at the expected level.

> > Introduce "violation_gfn_start", "violation_gfn_end", and
> > "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
> > level to TDX's private_max_mapping_level hook for determining the max
> > mapping level.
> > 
> > Instead of taking some bits of the error_code passed to
> > kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
> > a fault's max_level, having TDX's private_max_mapping_level hook check for
> > request level avoids changes to the KVM MMU core. This approach also
> > accommodates future scenarios where the requested mapping level is unknown
> > at the start of tdx_handle_ept_violation() (i.e., before invoking
> > kvm_mmu_page_fault()).
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/vmx/tdx.h      |  4 ++++
> >  arch/x86/kvm/vmx/tdx_arch.h |  3 +++
> >  3 files changed, 42 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 86775af85cd8..dd63a634e633 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> >  	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> >  }
> >  
> > +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +{
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	int level = -1;
> > +
> > +	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > +
> > +	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > +			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > +	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> > +		level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > +		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> > +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > +		tdx->violation_request_level = level;
> > +	} else {
> > +		tdx->violation_gfn_start = -1;
> > +		tdx->violation_gfn_end = -1;
> > +		tdx->violation_request_level = -1;
> 
> We had some internal conversations on how KVM used to stuff a bunch of fault
> stuff in the vcpu so it didn't have to pass it around, but now uses the fault
> struct for this. The point was (IIRC) to prevent stale data from getting
> confused on future faults, and it being hard to track what came from where.
> 
> In the TDX case, I think the potential for confusion is still there. The MMU
> code could use stale data if an accept EPT violation happens and control returns
> to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
> see the stale  tdx->violation_*. Not exactly a common case, but better to not
> have loose ends if we can avoid it.
> 
> Looking more closely, I don't see why it's too hard to pass in a max_fault_level
> into the fault struct. Totally untested rough idea, what do you think?
Thanks for bringing this up and providing the idea below. In the previous TDX
huge page v8, there's a similar implementation [1] [2].

This series did not adopt that approach because that approach requires
tdx_handle_ept_violation() to pass in max_fault_level, which is not always
available at that stage. e.g.

In patch 19, when vCPU 1 faults on a GFN at 2MB level and then vCPU 2 faults on
the same GFN at 4KB level, TDX wants to ignore the demotion request caused by
vCPU 2's 4KB level fault. So, patch 19 sets tdx->violation_request_level to 2MB
in vCPU 2's split callback and fails the split. vCPU 2's
__vmx_handle_ept_violation() will see RET_PF_RETRY and either do local retry (or
return to the guest).

If it retries locally, tdx_gmem_private_max_mapping_level() will return
tdx->violation_request_level, causing KVM to fault at 2MB level for vCPU 2,
resulting in a spurious fault, eventually returning to the guest.

As tdx->violation_request_level is per-vCPU and it resets in
tdx_get_accept_level() in tdx_handle_ept_violation() (meaning it resets after
each invocation of tdx_handle_ept_violation() and only affects the TDX local
retry loop), it should not hold any stale value.

Alternatively, instead of having tdx_gmem_private_max_mapping_level() to return
tdx->violation_request_level, tdx_handle_ept_violation() could grab
tdx->violation_request_level as the max_fault_level to pass to
__vmx_handle_ept_violation().

This series chose to use tdx_gmem_private_max_mapping_level() to avoid
modification to the KVM MMU core.

[1] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
[2 ]https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@intel.com/

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index faae82eefd99..3dc476da6391 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -282,7 +282,11 @@ enum x86_intercept_stage;
>   * when the guest was accessing private memory.
>   */
>  #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
> -#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
> +
> +#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
> +#define PFERR_FAULT_LEVEL_SHIFT 50
> +
> +#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
> PFERR_FAULT_LEVEL_MASK)
>  
>  /* apic attention bits */
>  #define KVM_APIC_CHECK_VAPIC   0
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1c1764f46e66..bdb1b0eabd67 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
>                 .nx_huge_page_workaround_enabled =
>                         is_nx_huge_page_enabled(vcpu->kvm),
>  
> -               .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +               .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
> PFERR_FAULT_LEVEL_SHIFT,
>                 .req_level = PG_LEVEL_4K,
>                 .goal_level = PG_LEVEL_4K,
>                 .is_private = err & PFERR_PRIVATE_ACCESS,
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 8f46a06e2c44..2f22b294ef8b 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
> gpa_t gpa)
>  }
>  
>  static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> -                                            unsigned long exit_qualification)
> +                                            unsigned long exit_qualification,
> +                                            u8 max_fault_level)
>  {
>         u64 error_code;
>  
> @@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
> kvm_vcpu *vcpu, gpa_t gpa,
>         if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
>                 error_code |= PFERR_PRIVATE_ACCESS;
>  
> +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
> hweight64(PFERR_FAULT_LEVEL_MASK)));
> +
> +       error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
> +
>         return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e994a6c08a75..19047de4d98d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>          * handle retries locally in their EPT violation handlers.
>          */
>         while (1) {
> -               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
> KVM_MAX_HUGEPAGE_LEVEL);
>  
>                 if (ret != RET_PF_RETRY || !local_retry)
>                         break;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..b70a2ff35884 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>         if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
> gpa)))
>                 return kvm_emulate_instruction(vcpu, 0);
>  
> -       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> +       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
> KVM_MAX_HUGEPAGE_LEVEL);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> 
>