[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvPrqMj1BWrkkwqN@yzhao56-desk.sh.intel.com>
Date: Wed, 25 Sep 2024 18:53:28 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Rick P Edgecombe
<rick.p.edgecombe@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
Yuan Yao <yuan.yao@...el.com>, Kai Huang <kai.huang@...el.com>,
"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "dmatlack@...gle.com"
<dmatlack@...gle.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY
with operand SEPT
On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > branch TDX_1.5.05.
> > >
> > > TL;DR:
> > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> >
> > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to
> > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > whatever reason.
> >
> > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > hits the fault?
> >
> > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > desirable because in many cases, the winning task will install a valid mapping
> > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > instruction is re-executed. In the happy case, that provides optimal performance
> > as KVM doesn't introduce any extra delay/latency.
> >
> > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > much higher, especially in light of the zero-step issues.
> >
> > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > loop in kvm_tdp_map_page().
> >
> > The only part I don't like about this idea is having two "retry" return values,
> > which creates the potential for bugs due to checking one but not the other.
> >
> > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that
> > option better even though the out-param is a bit gross, because it makes it more
> > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > most paths.
> Good idea.
> But could we extend it a bit more to allow TDX's EPT violation handler to also
> retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
FROZEN_SPTE might not be enough to prevent zero step mitigation.
E.g. in below selftest with a TD configured with pending_ve_disable=N,
zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
vm exit for more than 6 times (due to that user space does not do memslot
conversion correctly).
So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
in EPT violation vm exits.
#include <stdint.h>
#include "kvm_util.h"
#include "processor.h"
#include "tdx/tdcall.h"
#include "tdx/tdx.h"
#include "tdx/tdx_util.h"
#include "tdx/test_util.h"
#include "test_util.h"
/*
* 0x80000000 is arbitrarily selected, but it should not overlap with selftest
* code or boot page.
*/
#define ZERO_STEP_TEST_AREA_GPA (0x80000000)
/* Test area GPA is arbitrarily selected */
#define ZERO_STEP_AREA_GVA_PRIVATE (0x90000000)
/* The test area is 2MB in size */
#define ZERO_STEP_AREA_SIZE (2 << 20)
#define ZERO_STEP_ASSERT(x) \
do { \
if (!(x)) \
tdx_test_fatal(__LINE__); \
} while (0)
#define ZERO_STEP_ACCEPT_PRINT_PORT 0x87
#define ZERO_STEP_THRESHOLD 6
#define TRIGGER_ZERO_STEP_MITIGATION 1
static int convert_request_cnt;
static void guest_test_zero_step(void)
{
void *test_area_gva_private = (void *)ZERO_STEP_AREA_GVA_PRIVATE;
memset(test_area_gva_private, 1, 8);
tdx_test_success();
}
static void guest_ve_handler(struct ex_regs *regs)
{
uint64_t ret;
struct ve_info ve;
ret = tdg_vp_veinfo_get(&ve);
ZERO_STEP_ASSERT(!ret);
/* For this test, we will only handle EXIT_REASON_EPT_VIOLATION */
ZERO_STEP_ASSERT(ve.exit_reason == EXIT_REASON_EPT_VIOLATION);
tdx_test_send_64bit(ZERO_STEP_ACCEPT_PRINT_PORT, ve.gpa);
#define MEM_PAGE_ACCEPT_LEVEL_4K 0
#define MEM_PAGE_ACCEPT_LEVEL_2M 1
ret = tdg_mem_page_accept(ve.gpa, MEM_PAGE_ACCEPT_LEVEL_4K);
ZERO_STEP_ASSERT(!ret);
}
static void zero_step_test(void)
{
struct kvm_vm *vm;
struct kvm_vcpu *vcpu;
void *guest_code;
uint64_t test_area_npages;
vm_vaddr_t test_area_gva_private;
vm = td_create();
td_initialize(vm, VM_MEM_SRC_ANONYMOUS, 0);
guest_code = guest_test_zero_step;
vcpu = td_vcpu_add(vm, 0, guest_code);
vm_install_exception_handler(vm, VE_VECTOR, guest_ve_handler);
test_area_npages = ZERO_STEP_AREA_SIZE / vm->page_size;
vm_userspace_mem_region_add(vm,
VM_MEM_SRC_ANONYMOUS, ZERO_STEP_TEST_AREA_GPA,
3, test_area_npages, KVM_MEM_GUEST_MEMFD);
vm->memslots[MEM_REGION_TEST_DATA] = 3;
test_area_gva_private = ____vm_vaddr_alloc(
vm, ZERO_STEP_AREA_SIZE, ZERO_STEP_AREA_GVA_PRIVATE,
ZERO_STEP_TEST_AREA_GPA, MEM_REGION_TEST_DATA, true);
TEST_ASSERT_EQ(test_area_gva_private, ZERO_STEP_AREA_GVA_PRIVATE);
td_finalize(vm);
handle_memory_conversion(vm, ZERO_STEP_TEST_AREA_GPA, ZERO_STEP_AREA_SIZE, false);
for (;;) {
vcpu_run(vcpu);
if (vcpu->run->exit_reason == KVM_EXIT_IO &&
vcpu->run->io.port == ZERO_STEP_ACCEPT_PRINT_PORT) {
uint64_t gpa = tdx_test_read_64bit(
vcpu, ZERO_STEP_ACCEPT_PRINT_PORT);
printf("\t ... guest accepting 1 page at GPA: 0x%lx\n", gpa);
continue;
} else if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
bool skip = TRIGGER_ZERO_STEP_MITIGATION &&
(convert_request_cnt < ZERO_STEP_THRESHOLD -1);
convert_request_cnt++;
printf("guest request conversion of gpa 0x%llx - 0x%llx to %s, skip=%d\n",
vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size,
(vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE) ? "private" : "shared", skip);
if (skip)
continue;
handle_memory_conversion(
vm, vcpu->run->memory_fault.gpa,
vcpu->run->memory_fault.size,
vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE);
continue;
}
printf("exit reason %d\n", vcpu->run->exit_reason);
break;
}
kvm_vm_free(vm);
}
int main(int argc, char **argv)
{
/* Disable stdout buffering */
setbuf(stdout, NULL);
if (!is_tdx_enabled()) {
printf("TDX is not supported by the KVM\n"
"Skipping the TDX tests.\n");
return 0;
}
run_in_new_process(&zero_step_test);
}
Powered by blists - more mailing lists