linux-kernel - Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvPrqMj1BWrkkwqN@yzhao56-desk.sh.intel.com>
Date: Wed, 25 Sep 2024 18:53:28 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Rick P Edgecombe
	<rick.p.edgecombe@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	Yuan Yao <yuan.yao@...el.com>, Kai Huang <kai.huang@...el.com>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "dmatlack@...gle.com"
	<dmatlack@...gle.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY
 with operand SEPT

On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > branch TDX_1.5.05.
> > > 
> > > TL;DR:
> > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > 
> > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > whatever reason.
> > 
> > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > hits the fault?
> > 
> > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > desirable because in many cases, the winning task will install a valid mapping
> > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > instruction is re-executed.  In the happy case, that provides optimal performance
> > as KVM doesn't introduce any extra delay/latency.
> > 
> > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > much higher, especially in light of the zero-step issues.
> > 
> > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > loop in kvm_tdp_map_page().
> > 
> > The only part I don't like about this idea is having two "retry" return values,
> > which creates the potential for bugs due to checking one but not the other.
> > 
> > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > option better even though the out-param is a bit gross, because it makes it more
> > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > most paths.
> Good idea.
> But could we extend it a bit more to allow TDX's EPT violation handler to also
> retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
FROZEN_SPTE might not be enough to prevent zero step mitigation.

E.g. in below selftest with a TD configured with pending_ve_disable=N,
zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
vm exit for more than 6 times (due to that user space does not do memslot
conversion correctly).

So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
in EPT violation vm exits.


#include <stdint.h>

#include "kvm_util.h"
#include "processor.h"
#include "tdx/tdcall.h"
#include "tdx/tdx.h"
#include "tdx/tdx_util.h"
#include "tdx/test_util.h"
#include "test_util.h"

/*
 * 0x80000000 is arbitrarily selected, but it should not overlap with selftest
 * code or boot page.
 */
#define ZERO_STEP_TEST_AREA_GPA (0x80000000)
/* Test area GPA is arbitrarily selected */
#define ZERO_STEP_AREA_GVA_PRIVATE (0x90000000)

/* The test area is 2MB in size */
#define ZERO_STEP_AREA_SIZE (2 << 20)

#define ZERO_STEP_ASSERT(x)                             \
        do {                                            \
                if (!(x))                               \
                        tdx_test_fatal(__LINE__);       \
        } while (0)


#define ZERO_STEP_ACCEPT_PRINT_PORT 0x87

#define ZERO_STEP_THRESHOLD 6
#define TRIGGER_ZERO_STEP_MITIGATION 1

static int convert_request_cnt;

static void guest_test_zero_step(void)
{
        void *test_area_gva_private = (void *)ZERO_STEP_AREA_GVA_PRIVATE;

        memset(test_area_gva_private, 1, 8);
        tdx_test_success();
}

static void guest_ve_handler(struct ex_regs *regs)
{
        uint64_t ret;
        struct ve_info ve;

        ret = tdg_vp_veinfo_get(&ve);
        ZERO_STEP_ASSERT(!ret);

        /* For this test, we will only handle EXIT_REASON_EPT_VIOLATION */
        ZERO_STEP_ASSERT(ve.exit_reason == EXIT_REASON_EPT_VIOLATION);


        tdx_test_send_64bit(ZERO_STEP_ACCEPT_PRINT_PORT, ve.gpa);

#define MEM_PAGE_ACCEPT_LEVEL_4K 0
#define MEM_PAGE_ACCEPT_LEVEL_2M 1
        ret = tdg_mem_page_accept(ve.gpa, MEM_PAGE_ACCEPT_LEVEL_4K);
        ZERO_STEP_ASSERT(!ret);
}

static void zero_step_test(void)
{
        struct kvm_vm *vm;
        struct kvm_vcpu *vcpu;
        void *guest_code;
        uint64_t test_area_npages;
        vm_vaddr_t test_area_gva_private;

        vm = td_create();
        td_initialize(vm, VM_MEM_SRC_ANONYMOUS, 0);
        guest_code = guest_test_zero_step;
        vcpu = td_vcpu_add(vm, 0, guest_code);
        vm_install_exception_handler(vm, VE_VECTOR, guest_ve_handler);

        test_area_npages = ZERO_STEP_AREA_SIZE / vm->page_size;
        vm_userspace_mem_region_add(vm,
                                    VM_MEM_SRC_ANONYMOUS, ZERO_STEP_TEST_AREA_GPA,
                                    3, test_area_npages, KVM_MEM_GUEST_MEMFD);
        vm->memslots[MEM_REGION_TEST_DATA] = 3;

        test_area_gva_private = ____vm_vaddr_alloc(
                vm, ZERO_STEP_AREA_SIZE, ZERO_STEP_AREA_GVA_PRIVATE,
                ZERO_STEP_TEST_AREA_GPA, MEM_REGION_TEST_DATA, true);
        TEST_ASSERT_EQ(test_area_gva_private, ZERO_STEP_AREA_GVA_PRIVATE);

        td_finalize(vm);
	handle_memory_conversion(vm, ZERO_STEP_TEST_AREA_GPA, ZERO_STEP_AREA_SIZE, false);
        for (;;) {
                vcpu_run(vcpu);
                if (vcpu->run->exit_reason == KVM_EXIT_IO &&
                        vcpu->run->io.port == ZERO_STEP_ACCEPT_PRINT_PORT) {
                        uint64_t gpa = tdx_test_read_64bit(
                                        vcpu, ZERO_STEP_ACCEPT_PRINT_PORT);
                        printf("\t ... guest accepting 1 page at GPA: 0x%lx\n", gpa);
                        continue;
                } else if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
                        bool skip = TRIGGER_ZERO_STEP_MITIGATION &&
                                    (convert_request_cnt < ZERO_STEP_THRESHOLD -1);

                        convert_request_cnt++;

                        printf("guest request conversion of gpa 0x%llx - 0x%llx to %s, skip=%d\n",
                                vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size,
                                (vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE) ? "private" : "shared", skip);


                        if (skip)
                                continue;

                        handle_memory_conversion(
                                vm, vcpu->run->memory_fault.gpa,
                                vcpu->run->memory_fault.size,
                                vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE);
                        continue;
                }
                printf("exit reason %d\n", vcpu->run->exit_reason);
                break;
        }

        kvm_vm_free(vm);
}

int main(int argc, char **argv)
{
        /* Disable stdout buffering */
        setbuf(stdout, NULL);

        if (!is_tdx_enabled()) {
                printf("TDX is not supported by the KVM\n"
                       "Skipping the TDX tests.\n");
                return 0;
        }

        run_in_new_process(&zero_step_test);
}