[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f105f60b-5823-451f-ae26-8a103aa003d4@intel.com>
Date: Fri, 16 May 2025 14:12:21 +0800
From: Xiaoyao Li <xiaoyao.li@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"seanjc@...gle.com" <seanjc@...gle.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "Shutemov, Kirill" <kirill.shutemov@...el.com>,
"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
<david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"tabba@...gle.com" <tabba@...gle.com>, "Li, Zhiquan1"
<zhiquan1.li@...el.com>, "Du, Fan" <fan.du@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"michael.roth@....com" <michael.roth@....com>,
"Weiny, Ira" <ira.weiny@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>,
"Peng, Chao P" <chao.p.peng@...el.com>,
"Annapurve, Vishal" <vannapurve@...gle.com>,
"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according
to vCPU's ACCEPT level
On 5/14/2025 5:20 AM, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
>> Determine the max mapping level of a private GFN according to the vCPU's
>> ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
>>
>> When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
>> before any actual memory access, the vCPU's ACCEPT level is available in
>> the extended exit qualification. Set the vCPU's ACCEPT level as the max
>> mapping level for the faulting GFN. This is necessary because if KVM
>> specifies a mapping level greater than the vCPU's ACCEPT level, and no
>> other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
>> produce another EPT violation on the vCPU after re-entering the TD, with
>> the vCPU's ACCEPT level indicated in the extended exit qualification.
>
> Maybe a little more info would help. It's because the TDX module wants to
> "accept" the smaller size in the real S-EPT, but KVM created a huge page. It
> can't demote to do this without help from KVM.
>
>>
>> Introduce "violation_gfn_start", "violation_gfn_end", and
>> "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
>> level to TDX's private_max_mapping_level hook for determining the max
>> mapping level.
>>
>> Instead of taking some bits of the error_code passed to
>> kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
>> a fault's max_level, having TDX's private_max_mapping_level hook check for
>> request level avoids changes to the KVM MMU core. This approach also
>> accommodates future scenarios where the requested mapping level is unknown
>> at the start of tdx_handle_ept_violation() (i.e., before invoking
>> kvm_mmu_page_fault()).
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
>> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
>> ---
>> arch/x86/kvm/vmx/tdx.c | 36 +++++++++++++++++++++++++++++++++++-
>> arch/x86/kvm/vmx/tdx.h | 4 ++++
>> arch/x86/kvm/vmx/tdx_arch.h | 3 +++
>> 3 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 86775af85cd8..dd63a634e633 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>> return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>> }
>>
>> +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
>> +{
>> + struct vcpu_tdx *tdx = to_tdx(vcpu);
>> + int level = -1;
>> +
>> + u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
>> +
>> + u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
>> + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
>> +
>> + if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
>> + level = (eeq_info & GENMASK(2, 0)) + 1;
>> +
>> + tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
>> + tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
>> + tdx->violation_request_level = level;
>> + } else {
>> + tdx->violation_gfn_start = -1;
>> + tdx->violation_gfn_end = -1;
>> + tdx->violation_request_level = -1;
>
> We had some internal conversations on how KVM used to stuff a bunch of fault
> stuff in the vcpu so it didn't have to pass it around, but now uses the fault
> struct for this. The point was (IIRC) to prevent stale data from getting
> confused on future faults, and it being hard to track what came from where.
>
> In the TDX case, I think the potential for confusion is still there. The MMU
> code could use stale data if an accept EPT violation happens and control returns
> to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
> see the stale tdx->violation_*. Not exactly a common case, but better to not
> have loose ends if we can avoid it.
>
> Looking more closely, I don't see why it's too hard to pass in a max_fault_level
> into the fault struct. Totally untested rough idea, what do you think?
the original huge page support patch did encode the level info in
error_code. So it has my vote.
https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index faae82eefd99..3dc476da6391 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -282,7 +282,11 @@ enum x86_intercept_stage;
> * when the guest was accessing private memory.
> */
> #define PFERR_PRIVATE_ACCESS BIT_ULL(49)
> -#define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
> +
> +#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
> +#define PFERR_FAULT_LEVEL_SHIFT 50
> +
> +#define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
> PFERR_FAULT_LEVEL_MASK)
>
> /* apic attention bits */
> #define KVM_APIC_CHECK_VAPIC 0
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1c1764f46e66..bdb1b0eabd67 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
> .nx_huge_page_workaround_enabled =
> is_nx_huge_page_enabled(vcpu->kvm),
>
> - .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> + .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
> PFERR_FAULT_LEVEL_SHIFT,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> .is_private = err & PFERR_PRIVATE_ACCESS,
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 8f46a06e2c44..2f22b294ef8b 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
> gpa_t gpa)
> }
>
> static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> - unsigned long exit_qualification)
> + unsigned long exit_qualification,
> + u8 max_fault_level)
> {
> u64 error_code;
>
> @@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
> kvm_vcpu *vcpu, gpa_t gpa,
> if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
> error_code |= PFERR_PRIVATE_ACCESS;
>
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
> hweight64(PFERR_FAULT_LEVEL_MASK)));
> +
> + error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
> +
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> }
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e994a6c08a75..19047de4d98d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> * handle retries locally in their EPT violation handlers.
> */
> while (1) {
> - ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> + ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
> KVM_MAX_HUGEPAGE_LEVEL);
>
> if (ret != RET_PF_RETRY || !local_retry)
> break;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..b70a2ff35884 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
> gpa)))
> return kvm_emulate_instruction(vcpu, 0);
>
> - return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> + return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
> KVM_MAX_HUGEPAGE_LEVEL);
> }
>
> static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>
>
Powered by blists - more mailing lists