linux-kernel - Re: [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z76FxYfZlhDG/J3s@yzhao56-desk.sh.intel.com>
Date: Wed, 26 Feb 2025 11:08:53 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <rick.p.edgecombe@...el.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>
Subject: Re: [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO
 in mmu_stress_test

On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> On Sat, Feb 08, 2025, Yan Zhao wrote:
> > In the read-only mprotect() phase of mmu_stress_test, ensure that
> > mprotect(PROT_READ) has completed before the guest starts writing to the
> > read-only mprotect() memory.
> > 
> > Without waiting for mprotect_ro_done before the guest starts writing in
> > stage 3 (the stage for read-only mprotect()), the host's assertion of stage
> > 3 could fail if mprotect_ro_done is set to true in the window between the
> > guest finishing writes to all GPAs and executing GUEST_SYNC(3).
> > 
> > This scenario is easy to occur especially when there are hundred of vCPUs.
> > 
> > CPU 0                  CPU 1 guest     CPU 1 host
> >                                        enter stage 3's 1st loop
> >                        //in stage 3
> >                        write all GPAs
> >                        @rip 0x4025f0
> > 
> > mprotect(PROT_READ)
> > mprotect_ro_done=true
> >                        GUEST_SYNC(3)
> >                                        r=0, continue stage 3's 1st loop
> > 
> >                        //in stage 4
> >                        write GPA
> >                        @rip 0x402635
> > 
> >                                        -EFAULT, jump out stage 3's 1st loop
> >                                        enter stage 3's 2nd loop
> >                        write GPA
> >                        @rip 0x402635
> >                                        -EFAULT, continue stage 3's 2nd loop
> >                                        guest rip += 3
> > 
> > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > of length 4.
> 
> This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> defined?
stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
vcpu_arch_put_guest().

The original code expects that "mov %rax, (%rax)" in stage 3 can produce
-EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
loop.

	/*
         * Stage 3, write guest memory and verify KVM returns -EFAULT for once
         * the mprotect(PROT_READ) lands.  Only architectures that support
         * validating *all* of guest memory sync for this stage, as vCPUs will
         * be stuck on the faulting instruction for other architectures.  Go to
         * stage 3 without a rendezvous
         */
        do {
                r = _vcpu_run(vcpu);
        } while (!r);
        TEST_ASSERT(r == -1 && errno == EFAULT,
                    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);


Then, in host stage 3's 2st vcpu_run() loop, rip += 3 is performed to skip
"mov %rax, (%rax)" in guest.


        for (;;) {                                                                             
                r = _vcpu_run(vcpu);                                                           
                if (!r)                                                                        
                        break;                                                   
                TEST_ASSERT_EQ(errno, EFAULT);                                   
#if defined(__x86_64__)                                                          
                WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);        
                vcpu->run->s.regs.regs.rip += 3;                                 
#elif defined(__aarch64__)                                                       
                vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),                      
                             vcpu_get_reg(vcpu, ARM64_CORE_REG(regs.pc)) + 4);   
#endif                                                                           
                                                                                 
        } 

Finally, guest can call GUEST_SYNC(3) to let host jump out of the 2nd vcpu_run()
loop and host assert_sync_stage(vcpu, 3).


However, there're chances that all "mov %rax, (%rax)" in stage 3 does not cause
any -EFAULT. The guest then leaves stage 3 after finding mprotect_ro_done=true.

GUEST_SYNC(3) causes r=0, so host continues stage 3's first vcpu_run() loop.

Then mprotect(PROT_READ) takes effect after the guest enters stage 4.

vcpu_arch_put_guest() in guest stage 4 produces -EFAULT and host jumps out of
stage 3's first vcpu_run() loop.
The rip+=3 in host stage 3's second vcpu_run() loop does not match
vcpu_arch_put_guest(), producing the "Unhandled exception '0xe'" error.


> 
> 	do {
> 		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> #ifdef __x86_64__
> 			asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> #elif defined(__aarch64__)
> 			asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
> #else
> 			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> #endif
> 	} while (!READ_ONCE(mprotect_ro_done));
> 
> 	/*
> 	 * Only architectures that write the entire range can explicitly sync,
> 	 * as other architectures will be stuck on the write fault.
> 	 */
> #if defined(__x86_64__) || defined(__aarch64__)
> 	GUEST_SYNC(3);
> #endif
> 
> 	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> 		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> 	GUEST_SYNC(4);
> 
> 
> 
> > Even if it could be compiled into a mov instruction of length 3, the
> > following execution of GUEST_SYNC(4) in guest could still cause the host
> > failure of the assertion of stage 3.
> 
> Sorry, I don't follow.  The guest doesn't get "released" from GUEST_SYNC(3) until
> the host runs the vCPU again, and that happens after asserting on stage 3 and
> synchronizing with the main thread.


//guest stage 3 
do {
    for (...)  
    	mov %rax, (%rax)  3.1 ==> host mprotect(PROT_READ) does not yet
	                          take effect, mprotect_ro_done=false

} while (!READ_ONCE(mprotect_ro_done)); ==> 3.2 host mprotect(PROT_READ) takes
                                            effect here. mprotect_ro_done=true

GUEST_SYNC(3);                ==> host stage 3's vcpu_run() returns 0. so host
                                  is still in stage 3's first vcpu_run() loop


//guest stage 4. with host mprotect(PROT_READ) in effect, vcpu_arch_put_guest()
  will cause -EFAULT. host enters host stage 3's second vcpu_run() loop.

for (...)
    vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); ==> wrong for rip += 3

GUEST_SYNC(4); ==> since host still in stage 3, even if we change
                   vcpu_arch_put_guest() in guest stage 4 to "mov %rax, (%rax)",
		   this cannot pass host's assert_sync_stage(vcpu, 3);

> 
> 	assert_sync_stage(vcpu, 3);
> #endif /* __x86_64__ || __aarch64__ */
> 	rendezvous_with_boss();
> 
> 	/*
> 	 * Stage 4.  Run to completion, waiting for mprotect(PROT_WRITE) to
> 	 * make the memory writable again.
> 	 */
> 	do {
> 		r = _vcpu_run(vcpu);
> 	} while (r && errno == EFAULT);