linux-kernel - Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f105f60b-5823-451f-ae26-8a103aa003d4@intel.com>
Date: Fri, 16 May 2025 14:12:21 +0800
From: Xiaoyao Li <xiaoyao.li@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "seanjc@...gle.com" <seanjc@...gle.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "Shutemov, Kirill" <kirill.shutemov@...el.com>,
 "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
 <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
 "tabba@...gle.com" <tabba@...gle.com>, "Li, Zhiquan1"
 <zhiquan1.li@...el.com>, "Du, Fan" <fan.du@...el.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "michael.roth@....com" <michael.roth@....com>,
 "Weiny, Ira" <ira.weiny@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
 "binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
 "ackerleytng@...gle.com" <ackerleytng@...gle.com>,
 "Yamahata, Isaku" <isaku.yamahata@...el.com>,
 "Peng, Chao P" <chao.p.peng@...el.com>,
 "Annapurve, Vishal" <vannapurve@...gle.com>,
 "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
 "pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according
 to vCPU's ACCEPT level

On 5/14/2025 5:20 AM, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
>> Determine the max mapping level of a private GFN according to the vCPU's
>> ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
>>
>> When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
>> before any actual memory access, the vCPU's ACCEPT level is available in
>> the extended exit qualification. Set the vCPU's ACCEPT level as the max
>> mapping level for the faulting GFN. This is necessary because if KVM
>> specifies a mapping level greater than the vCPU's ACCEPT level, and no
>> other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
>> produce another EPT violation on the vCPU after re-entering the TD, with
>> the vCPU's ACCEPT level indicated in the extended exit qualification.
> 
> Maybe a little more info would help. It's because the TDX module wants to
> "accept" the smaller size in the real S-EPT, but KVM created a huge page. It
> can't demote to do this without help from KVM.
> 
>>
>> Introduce "violation_gfn_start", "violation_gfn_end", and
>> "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
>> level to TDX's private_max_mapping_level hook for determining the max
>> mapping level.
>>
>> Instead of taking some bits of the error_code passed to
>> kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
>> a fault's max_level, having TDX's private_max_mapping_level hook check for
>> request level avoids changes to the KVM MMU core. This approach also
>> accommodates future scenarios where the requested mapping level is unknown
>> at the start of tdx_handle_ept_violation() (i.e., before invoking
>> kvm_mmu_page_fault()).
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
>> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
>> ---
>>   arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
>>   arch/x86/kvm/vmx/tdx.h      |  4 ++++
>>   arch/x86/kvm/vmx/tdx_arch.h |  3 +++
>>   3 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 86775af85cd8..dd63a634e633 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>>   	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>>   }
>>   
>> +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
>> +{
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +	int level = -1;
>> +
>> +	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
>> +
>> +	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
>> +			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
>> +
>> +	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
>> +		level = (eeq_info & GENMASK(2, 0)) + 1;
>> +
>> +		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
>> +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
>> +		tdx->violation_request_level = level;
>> +	} else {
>> +		tdx->violation_gfn_start = -1;
>> +		tdx->violation_gfn_end = -1;
>> +		tdx->violation_request_level = -1;
> 
> We had some internal conversations on how KVM used to stuff a bunch of fault
> stuff in the vcpu so it didn't have to pass it around, but now uses the fault
> struct for this. The point was (IIRC) to prevent stale data from getting
> confused on future faults, and it being hard to track what came from where.
> 
> In the TDX case, I think the potential for confusion is still there. The MMU
> code could use stale data if an accept EPT violation happens and control returns
> to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
> see the stale  tdx->violation_*. Not exactly a common case, but better to not
> have loose ends if we can avoid it.
> 
> Looking more closely, I don't see why it's too hard to pass in a max_fault_level
> into the fault struct. Totally untested rough idea, what do you think?

the original huge page support patch did encode the level info in 
error_code. So it has my vote.

https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index faae82eefd99..3dc476da6391 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -282,7 +282,11 @@ enum x86_intercept_stage;
>    * when the guest was accessing private memory.
>    */
>   #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
> -#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
> +
> +#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
> +#define PFERR_FAULT_LEVEL_SHIFT 50
> +
> +#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
> PFERR_FAULT_LEVEL_MASK)
>   
>   /* apic attention bits */
>   #define KVM_APIC_CHECK_VAPIC   0
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1c1764f46e66..bdb1b0eabd67 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
>                  .nx_huge_page_workaround_enabled =
>                          is_nx_huge_page_enabled(vcpu->kvm),
>   
> -               .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +               .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
> PFERR_FAULT_LEVEL_SHIFT,
>                  .req_level = PG_LEVEL_4K,
>                  .goal_level = PG_LEVEL_4K,
>                  .is_private = err & PFERR_PRIVATE_ACCESS,
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 8f46a06e2c44..2f22b294ef8b 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
> gpa_t gpa)
>   }
>   
>   static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> -                                            unsigned long exit_qualification)
> +                                            unsigned long exit_qualification,
> +                                            u8 max_fault_level)
>   {
>          u64 error_code;
>   
> @@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
> kvm_vcpu *vcpu, gpa_t gpa,
>          if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
>                  error_code |= PFERR_PRIVATE_ACCESS;
>   
> +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
> hweight64(PFERR_FAULT_LEVEL_MASK)));
> +
> +       error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
> +
>          return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e994a6c08a75..19047de4d98d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>           * handle retries locally in their EPT violation handlers.
>           */
>          while (1) {
> -               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
> KVM_MAX_HUGEPAGE_LEVEL);
>   
>                  if (ret != RET_PF_RETRY || !local_retry)
>                          break;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..b70a2ff35884 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>          if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
> gpa)))
>                  return kvm_emulate_instruction(vcpu, 0);
>   
> -       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> +       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
> KVM_MAX_HUGEPAGE_LEVEL);
>   }
>   
>   static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> 
>