linux-kernel - Re: [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8GWHkpSt+zPf+SQ@yzhao56-desk.sh.intel.com>
Date: Fri, 28 Feb 2025 18:55:26 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <rick.p.edgecombe@...el.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>
Subject: Re: [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO
 in mmu_stress_test

On Thu, Feb 27, 2025 at 02:18:02PM -0800, Sean Christopherson wrote:
> On Thu, Feb 27, 2025, Yan Zhao wrote:
> > On Wed, Feb 26, 2025 at 11:30:15AM -0800, Sean Christopherson wrote:
> > > On Wed, Feb 26, 2025, Yan Zhao wrote:
> > > > On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> > > > > On Sat, Feb 08, 2025, Yan Zhao wrote:
> > > > > > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > > > > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > > > > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > > > > > of length 4.
> > > > > 
> > > > > This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> > > > > vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> > > > > defined?
> > > > stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
> > > > vcpu_arch_put_guest().
> > > > 
> > > > The original code expects that "mov %rax, (%rax)" in stage 3 can produce
> > > > -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
> > > > loop.
> > > 
> > > Ugh, I forgot that there are two loops in stage-3.  I tried to prevent this race,
> > > but violated my own rule of not using arbitrary delays to avoid races.
> > > 
> > > Completely untested, but I think this should address the problem (I'll test
> > > later today; you already did the hard work of debugging).  The only thing I'm
> > > not positive is correct is making the first _vcpu_run() a one-off instead of a
> > > loop.
> > Right, making the first _vcpu_run() a one-off could produce below error:
> > "Expected EFAULT on write to RO memory, got r = 0, errno = 4".
> 
> /facepalm
> 
> There are multiple vCPU, using a single flag isn't sufficient.  I also remembered
> (well, re-discovered) why I added the weird looping on "!":
> 
> 	do {                                                                    
> 		r = _vcpu_run(vcpu);                                            
> 	} while (!r);
> 
> On x86, with forced emulation, the vcpu_arch_put_guest() path hits an MMIO exit
> due to a longstanding (like, forever longstanding) bug in KVM's emulator.  Given
> that the vcpu_arch_put_guest() path is only reachable by disabling the x86 specific
> code (which I did for testing those paths), and that the bug only manifests on x86,
> I think it makes sense to drop that code as it's super confusing, gets in the way,
> and is unreachable unless the user is going way out of their way to hit it.
Thanks for this background.


> I still haven't reproduced the failure without "help", but I was able to force
> failure by doing a single write and dropping the mprotect_ro_done check:
> 
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index a1f3f6d83134..3524dcc0dfcf 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -50,15 +50,15 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>          */
>         GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
>         do {
> -               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> +               // for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
>  #ifdef __x86_64__
> -                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> +                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(end_gpa - stride) : "memory"); /* mov %rax, (%rax) */
>  #elif defined(__aarch64__)
>                         asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
>  #else
>                         vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
>  #endif
> -       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
> +       } while (!READ_ONCE(all_vcpus_hit_ro_fault));
>  
>         /*
>          * Only architectures that write the entire range can explicitly sync,
> 
> The below makes everything happy, can you verify the fix on your end?

This fix can make the issue disappear on my end. However, the issue is also not
reproducible even merely with the following change...

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..e664713d2a2c 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -18,6 +18,7 @@
 #include "ucall_common.h"

 static bool mprotect_ro_done;
+static bool all_vcpus_hit_ro_fault;

 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
@@ -34,6 +35,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
                *((volatile uint64_t *)gpa);
        GUEST_SYNC(2);

+       GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        /*
         * Write to the region while mprotect(PROT_READ) is underway.  Keep
         * looping until the memory is guaranteed to be read-only, otherwise


I think it's due to the extra delay (the assert) in guest, as I previously also
mentioned at
https://lore.kernel.org/kvm/Z6xGwnFR9cFg%2FTOL@yzhao56-desk.sh.intel.com .

If I apply you fix with the guest delay dropped, the issue re-appears.

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index a1f3f6d83134..f87fd40dbed3 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -48,7 +48,6 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * is low in this case).  For x86, hand-code the exact opcode so that
         * there is no room for variability in the generated instruction.
         */
-       GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)

> ---
>  tools/testing/selftests/kvm/mmu_stress_test.c | 22 ++++++++++++-------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index d9c76b4c0d88..a1f3f6d83134 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -18,6 +18,7 @@
>  #include "ucall_common.h"
>  
>  static bool mprotect_ro_done;
> +static bool all_vcpus_hit_ro_fault;
>  
>  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  {
> @@ -36,9 +37,9 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  
>  	/*
>  	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
> -	 * looping until the memory is guaranteed to be read-only, otherwise
> -	 * vCPUs may complete their writes and advance to the next stage
> -	 * prematurely.
> +	 * looping until the memory is guaranteed to be read-only and a fault
> +	 * has occured, otherwise vCPUs may complete their writes and advance
> +	 * to the next stage prematurely.
>  	 *
>  	 * For architectures that support skipping the faulting instruction,
>  	 * generate the store via inline assembly to ensure the exact length
> @@ -47,6 +48,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  	 * is low in this case).  For x86, hand-code the exact opcode so that
>  	 * there is no room for variability in the generated instruction.
>  	 */
> +	GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
>  	do {
>  		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
>  #ifdef __x86_64__
> @@ -56,7 +58,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  #else
>  			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
>  #endif
> -	} while (!READ_ONCE(mprotect_ro_done));
> +	} while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
This looks not correct.

The while loop stops when
mprotect_ro_done | all_vcpus_hit_ro_fault
-----------------|----------------------
   true          |      false ==>producing "Expected EFAULT on write to RO memory"
   true          |      true
   false         |      true  (invalid case)
   

So, I think the right one is:
-	} while (!READ_ONCE(mprotect_ro_done));
+	} while (!READ_ONCE(mprotect_ro_done) || !READ_ONCE(all_vcpus_hit_ro_fault));

Then the while loop stops only when
mprotect_ro_done | all_vcpus_hit_ro_fault
-----------------|----------------------
  true           |      true

>  	/*
>  	 * Only architectures that write the entire range can explicitly sync,
> @@ -81,6 +83,7 @@ struct vcpu_info {
>  
>  static int nr_vcpus;
>  static atomic_t rendezvous;
> +static atomic_t nr_ro_faults;
>  
>  static void rendezvous_with_boss(void)
>  {
> @@ -148,12 +151,16 @@ static void *vcpu_worker(void *data)
>  	 * be stuck on the faulting instruction for other architectures.  Go to
>  	 * stage 3 without a rendezvous
>  	 */
> -	do {
> -		r = _vcpu_run(vcpu);
> -	} while (!r);
> +	r = _vcpu_run(vcpu);
>  	TEST_ASSERT(r == -1 && errno == EFAULT,
>  		    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
>  
> +	atomic_inc(&nr_ro_faults);
> +	if (atomic_read(&nr_ro_faults) == nr_vcpus) {
> +		WRITE_ONCE(all_vcpus_hit_ro_fault, true);
> +		sync_global_to_guest(vm, all_vcpus_hit_ro_fault);
> +	}
> +
>  #if defined(__x86_64__) || defined(__aarch64__)
>  	/*
>  	 * Verify *all* writes from the guest hit EFAULT due to the VMA now
> @@ -378,7 +385,6 @@ int main(int argc, char *argv[])
>  	rendezvous_with_vcpus(&time_run2, "run 2");
>  
>  	mprotect(mem, slot_size, PROT_READ);
> -	usleep(10);
>  	mprotect_ro_done = true;
>  	sync_global_to_guest(vm, mprotect_ro_done);
>  
> 
> base-commit: 557953f8b75fce49dc65f9b0f7e811c060fc7860
> -- 
So, with below change based on your fix above, the pass rate on my end is
100%. (If only with the first hunk below, the pass rate is 10%).

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index a1f3f6d83134..1c65c9c3f41f 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -48,7 +48,6 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * is low in this case).  For x86, hand-code the exact opcode so that
         * there is no room for variability in the generated instruction.
         */
-       GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
@@ -58,7 +57,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 #else
                        vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
+       } while (!READ_ONCE(mprotect_ro_done) || !READ_ONCE(all_vcpus_hit_ro_fault));

        /*
         * Only architectures that write the entire range can explicitly sync,