lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zb1bUU4V6TYFZ972@google.com>
Date: Fri, 2 Feb 2024 13:14:57 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Shaoqin Huang <shahuang@...hat.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Peter Xu <peterx@...hat.com>, 
	Shuah Khan <shuah@...nel.org>, kvm@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] KVM: selftests: Fix the dirty_log_test semaphore imbalance

On Fri, Feb 02, 2024, Sean Christopherson wrote:
> On Fri, Feb 02, 2024, Shaoqin Huang wrote:
> > ---
> > v2->v3:
> >   - Rebase to v6.8-rc2.
> >   - Use TEST_ASSERT().
> 
> Patch says otherwise.
> 
> > @@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >  		return;
> >  	}
> >  
> > +	sem_getvalue(&sem_vcpu_stop, &sem_val);
> > +	assert(sem_val == 0);
> > +	sem_getvalue(&sem_vcpu_cont, &sem_val);
> > +	assert(sem_val == 0);
> > +
> >  	/*
> >  	 * We reserve page table for 2 times of extra dirty mem which
> >  	 * will definitely cover the original (1G+) test range.  Here
> > @@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >  		sync_global_to_guest(vm, iteration);
> >  	}
> >  
> > +	/*
> > +	 *
> > +	 * Before we set the host_quit, let the vcpu has time to run, to make
> > +	 * sure we consume the sem_vcpu_stop and the vcpu consume the
> > +	 * sem_vcpu_cont, to keep the semaphore balance.
> > +	 */
> > +	usleep(p->interval * 1000);
> 
> Sorry, I wasn't as explicit as I should have been.  When I said I don't have a
> better solution, I did not mean to imply that I am ok with busy waiting as a
> hack-a-around.
> 
> Against my better judgment, I spent half an hour slogging through this test to
> figure out what's going wrong.  IIUC, the problem is essentially that the test
> instructs the vCPU worker to continue _after_ the last iteration, and _then_ sets
> host_quit, which results in the vCPU running one extra (unvalidated) iteration.
> 
> For the other modes, which stop if and only if vcpu_sync_stop_requested is set,
> the extra iteration is a non-issue.  But because the dirty ring variant stops
> after every exit (to purge the ring), it hangs without an extra "continue".
> 
> So rather than blindly fire off an extra sem_vcpu_cont that may or may not be
> consumed, I believe the test can simply set host_quit _before_ the final "continue"
> so that the vCPU worker doesn't run an extra iteration.
> 
> I ran the below with 1000 loops of "for (i = 0; i < LOG_MODE_NUM; i++)" and so
> no issues.  I didn't see the assert you hit, but without the fix, I did see this
> fire within a few loops (less than 5 I think)l
> 
> 	assert(host_log_mode == LOG_MODE_DIRTY_RING ||
> 	       atomic_read(&vcpu_sync_stop_requested) == false);
> 
> I'll post this as two patches: one to fix the bug, and a second to have the
> LOG_MODE_DIRTY_RING variant clear vcpu_sync_stop_requested so that the above
> assert() can be modified as below.

Scratch patch 2, I'm pretty sure the vCPU worker can race with the main task and
clear vcpu_sync_stop_requested just before the main task sets it, which would
result in a false positive.  I didn't see any failures, but I'm 99% certain the
race exists.  I suspect there are some other warts in the test that would
complicate attempts to clean things up, so for now I'l just post the fix for
the imbalance bug.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ