linux-kernel - Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b50bfb56-c2f2-493e-bd87-1c5aaa8bfb59@intel.com>
Date: Tue, 17 Sep 2024 14:11:01 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Yan Zhao <yan.y.zhao@...el.com>
CC: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, Yuan Yao <yuan.yao@...el.com>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "dmatlack@...gle.com"
	<dmatlack@...gle.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with
 operand SEPT



On 14/09/2024 5:23 am, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
>> This is a lock status report of TDX module for current SEAMCALL retry issue
>> based on code in TDX module public repo https://github.com/intel/tdx-module.git
>> branch TDX_1.5.05.
>>
>> TL;DR:
>> - tdh_mem_track() can contend with tdh_vp_enter().
>> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
> 

[...]

>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   
>   retry:
>          rcu_read_unlock();
> +       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> +               return RET_PF_RETRY_FOZEN;

Ack the whole "retry on frozen" approach, either with RET_PF_RETRY_FOZEN 
or fault->frozen_spte.

One minor side effect:

For normal VMs, the fault handler can also see a frozen spte, e.g, when 
kvm_tdp_mmu_map() checks the middle level SPTE:

	/*
          * If SPTE has been frozen by another thread, just give up and
          * retry, avoiding unnecessary page table allocation and free.
          */
         if (is_frozen_spte(iter.old_spte))
         	goto retry;

So for normal VM this RET_PF_RETRY_FOZEN will change "go back to guest 
to retry" to "retry in KVM internally".

As you mentioned above for normal VMs we probably always want to "go 
back to guest to retry" even for FROZEN SPTE, but I guess this is a 
minor issue that we can even notice.

Or we can additionally add:

	if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)
			&& is_mirrored_sptep(iter.sptep))
		return RET_PF_RETRY_FOZEN;

So it only applies to TDX.