linux-kernel - Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7307f6308d6e506bad00749307400fc6e65bc6e4.camel@intel.com>
Date: Tue, 13 May 2025 21:20:01 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "Shutemov, Kirill" <kirill.shutemov@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "michael.roth@....com"
	<michael.roth@....com>, "Weiny, Ira" <ira.weiny@...el.com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according
 to vCPU's ACCEPT level

On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
> Determine the max mapping level of a private GFN according to the vCPU's
> ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
> 
> When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
> before any actual memory access, the vCPU's ACCEPT level is available in
> the extended exit qualification. Set the vCPU's ACCEPT level as the max
> mapping level for the faulting GFN. This is necessary because if KVM
> specifies a mapping level greater than the vCPU's ACCEPT level, and no
> other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
> produce another EPT violation on the vCPU after re-entering the TD, with
> the vCPU's ACCEPT level indicated in the extended exit qualification.

Maybe a little more info would help. It's because the TDX module wants to
"accept" the smaller size in the real S-EPT, but KVM created a huge page. It
can't demote to do this without help from KVM.

> 
> Introduce "violation_gfn_start", "violation_gfn_end", and
> "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
> level to TDX's private_max_mapping_level hook for determining the max
> mapping level.
> 
> Instead of taking some bits of the error_code passed to
> kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
> a fault's max_level, having TDX's private_max_mapping_level hook check for
> request level avoids changes to the KVM MMU core. This approach also
> accommodates future scenarios where the requested mapping level is unknown
> at the start of tdx_handle_ept_violation() (i.e., before invoking
> kvm_mmu_page_fault()).
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> ---
>  arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.h      |  4 ++++
>  arch/x86/kvm/vmx/tdx_arch.h |  3 +++
>  3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 86775af85cd8..dd63a634e633 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>  	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>  }
>  
> +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	int level = -1;
> +
> +	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> +
> +	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> +			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +
> +	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> +		level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> +		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> +		tdx->violation_request_level = level;
> +	} else {
> +		tdx->violation_gfn_start = -1;
> +		tdx->violation_gfn_end = -1;
> +		tdx->violation_request_level = -1;

We had some internal conversations on how KVM used to stuff a bunch of fault
stuff in the vcpu so it didn't have to pass it around, but now uses the fault
struct for this. The point was (IIRC) to prevent stale data from getting
confused on future faults, and it being hard to track what came from where.

In the TDX case, I think the potential for confusion is still there. The MMU
code could use stale data if an accept EPT violation happens and control returns
to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
see the stale  tdx->violation_*. Not exactly a common case, but better to not
have loose ends if we can avoid it.

Looking more closely, I don't see why it's too hard to pass in a max_fault_level
into the fault struct. Totally untested rough idea, what do you think?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index faae82eefd99..3dc476da6391 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -282,7 +282,11 @@ enum x86_intercept_stage;
  * when the guest was accessing private memory.
  */
 #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
-#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
+
+#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
+#define PFERR_FAULT_LEVEL_SHIFT 50
+
+#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
PFERR_FAULT_LEVEL_MASK)
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC   0
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1c1764f46e66..bdb1b0eabd67 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
*vcpu, gpa_t cr2_or_gpa,
                .nx_huge_page_workaround_enabled =
                        is_nx_huge_page_enabled(vcpu->kvm),
 
-               .max_level = KVM_MAX_HUGEPAGE_LEVEL,
+               .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
PFERR_FAULT_LEVEL_SHIFT,
                .req_level = PG_LEVEL_4K,
                .goal_level = PG_LEVEL_4K,
                .is_private = err & PFERR_PRIVATE_ACCESS,
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 8f46a06e2c44..2f22b294ef8b 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
gpa_t gpa)
 }
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
-                                            unsigned long exit_qualification)
+                                            unsigned long exit_qualification,
+                                            u8 max_fault_level)
 {
        u64 error_code;
 
@@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
kvm_vcpu *vcpu, gpa_t gpa,
        if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
                error_code |= PFERR_PRIVATE_ACCESS;
 
+       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
hweight64(PFERR_FAULT_LEVEL_MASK)));
+
+       error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
+
        return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e994a6c08a75..19047de4d98d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
         * handle retries locally in their EPT violation handlers.
         */
        while (1) {
-               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
KVM_MAX_HUGEPAGE_LEVEL);
 
                if (ret != RET_PF_RETRY || !local_retry)
                        break;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ef2d7208dd20..b70a2ff35884 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
        if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
gpa)))
                return kvm_emulate_instruction(vcpu, 0);
 
-       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
+       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
KVM_MAX_HUGEPAGE_LEVEL);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)