[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z76FxYfZlhDG/J3s@yzhao56-desk.sh.intel.com>
Date: Wed, 26 Feb 2025 11:08:53 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <rick.p.edgecombe@...el.com>,
<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>
Subject: Re: [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO
in mmu_stress_test
On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> On Sat, Feb 08, 2025, Yan Zhao wrote:
> > In the read-only mprotect() phase of mmu_stress_test, ensure that
> > mprotect(PROT_READ) has completed before the guest starts writing to the
> > read-only mprotect() memory.
> >
> > Without waiting for mprotect_ro_done before the guest starts writing in
> > stage 3 (the stage for read-only mprotect()), the host's assertion of stage
> > 3 could fail if mprotect_ro_done is set to true in the window between the
> > guest finishing writes to all GPAs and executing GUEST_SYNC(3).
> >
> > This scenario is easy to occur especially when there are hundred of vCPUs.
> >
> > CPU 0 CPU 1 guest CPU 1 host
> > enter stage 3's 1st loop
> > //in stage 3
> > write all GPAs
> > @rip 0x4025f0
> >
> > mprotect(PROT_READ)
> > mprotect_ro_done=true
> > GUEST_SYNC(3)
> > r=0, continue stage 3's 1st loop
> >
> > //in stage 4
> > write GPA
> > @rip 0x402635
> >
> > -EFAULT, jump out stage 3's 1st loop
> > enter stage 3's 2nd loop
> > write GPA
> > @rip 0x402635
> > -EFAULT, continue stage 3's 2nd loop
> > guest rip += 3
> >
> > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > of length 4.
>
> This shouldn't happen. On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> vcpu_arch_put_guest(). Either something else is going on, or __x86_64__ isn't
> defined?
stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
vcpu_arch_put_guest().
The original code expects that "mov %rax, (%rax)" in stage 3 can produce
-EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
loop.
/*
* Stage 3, write guest memory and verify KVM returns -EFAULT for once
* the mprotect(PROT_READ) lands. Only architectures that support
* validating *all* of guest memory sync for this stage, as vCPUs will
* be stuck on the faulting instruction for other architectures. Go to
* stage 3 without a rendezvous
*/
do {
r = _vcpu_run(vcpu);
} while (!r);
TEST_ASSERT(r == -1 && errno == EFAULT,
"Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
Then, in host stage 3's 2st vcpu_run() loop, rip += 3 is performed to skip
"mov %rax, (%rax)" in guest.
for (;;) {
r = _vcpu_run(vcpu);
if (!r)
break;
TEST_ASSERT_EQ(errno, EFAULT);
#if defined(__x86_64__)
WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
vcpu->run->s.regs.regs.rip += 3;
#elif defined(__aarch64__)
vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),
vcpu_get_reg(vcpu, ARM64_CORE_REG(regs.pc)) + 4);
#endif
}
Finally, guest can call GUEST_SYNC(3) to let host jump out of the 2nd vcpu_run()
loop and host assert_sync_stage(vcpu, 3).
However, there're chances that all "mov %rax, (%rax)" in stage 3 does not cause
any -EFAULT. The guest then leaves stage 3 after finding mprotect_ro_done=true.
GUEST_SYNC(3) causes r=0, so host continues stage 3's first vcpu_run() loop.
Then mprotect(PROT_READ) takes effect after the guest enters stage 4.
vcpu_arch_put_guest() in guest stage 4 produces -EFAULT and host jumps out of
stage 3's first vcpu_run() loop.
The rip+=3 in host stage 3's second vcpu_run() loop does not match
vcpu_arch_put_guest(), producing the "Unhandled exception '0xe'" error.
>
> do {
> for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> #ifdef __x86_64__
> asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> #elif defined(__aarch64__)
> asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
> #else
> vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> #endif
> } while (!READ_ONCE(mprotect_ro_done));
>
> /*
> * Only architectures that write the entire range can explicitly sync,
> * as other architectures will be stuck on the write fault.
> */
> #if defined(__x86_64__) || defined(__aarch64__)
> GUEST_SYNC(3);
> #endif
>
> for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> GUEST_SYNC(4);
>
>
>
> > Even if it could be compiled into a mov instruction of length 3, the
> > following execution of GUEST_SYNC(4) in guest could still cause the host
> > failure of the assertion of stage 3.
>
> Sorry, I don't follow. The guest doesn't get "released" from GUEST_SYNC(3) until
> the host runs the vCPU again, and that happens after asserting on stage 3 and
> synchronizing with the main thread.
//guest stage 3
do {
for (...)
mov %rax, (%rax) 3.1 ==> host mprotect(PROT_READ) does not yet
take effect, mprotect_ro_done=false
} while (!READ_ONCE(mprotect_ro_done)); ==> 3.2 host mprotect(PROT_READ) takes
effect here. mprotect_ro_done=true
GUEST_SYNC(3); ==> host stage 3's vcpu_run() returns 0. so host
is still in stage 3's first vcpu_run() loop
//guest stage 4. with host mprotect(PROT_READ) in effect, vcpu_arch_put_guest()
will cause -EFAULT. host enters host stage 3's second vcpu_run() loop.
for (...)
vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); ==> wrong for rip += 3
GUEST_SYNC(4); ==> since host still in stage 3, even if we change
vcpu_arch_put_guest() in guest stage 4 to "mov %rax, (%rax)",
this cannot pass host's assert_sync_stage(vcpu, 3);
>
> assert_sync_stage(vcpu, 3);
> #endif /* __x86_64__ || __aarch64__ */
> rendezvous_with_boss();
>
> /*
> * Stage 4. Run to completion, waiting for mprotect(PROT_WRITE) to
> * make the memory writable again.
> */
> do {
> r = _vcpu_run(vcpu);
> } while (r && errno == EFAULT);
Powered by blists - more mailing lists