[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zb1bUU4V6TYFZ972@google.com>
Date: Fri, 2 Feb 2024 13:14:57 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Shaoqin Huang <shahuang@...hat.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Peter Xu <peterx@...hat.com>,
Shuah Khan <shuah@...nel.org>, kvm@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] KVM: selftests: Fix the dirty_log_test semaphore imbalance
On Fri, Feb 02, 2024, Sean Christopherson wrote:
> On Fri, Feb 02, 2024, Shaoqin Huang wrote:
> > ---
> > v2->v3:
> > - Rebase to v6.8-rc2.
> > - Use TEST_ASSERT().
>
> Patch says otherwise.
>
> > @@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> > return;
> > }
> >
> > + sem_getvalue(&sem_vcpu_stop, &sem_val);
> > + assert(sem_val == 0);
> > + sem_getvalue(&sem_vcpu_cont, &sem_val);
> > + assert(sem_val == 0);
> > +
> > /*
> > * We reserve page table for 2 times of extra dirty mem which
> > * will definitely cover the original (1G+) test range. Here
> > @@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> > sync_global_to_guest(vm, iteration);
> > }
> >
> > + /*
> > + *
> > + * Before we set the host_quit, let the vcpu has time to run, to make
> > + * sure we consume the sem_vcpu_stop and the vcpu consume the
> > + * sem_vcpu_cont, to keep the semaphore balance.
> > + */
> > + usleep(p->interval * 1000);
>
> Sorry, I wasn't as explicit as I should have been. When I said I don't have a
> better solution, I did not mean to imply that I am ok with busy waiting as a
> hack-a-around.
>
> Against my better judgment, I spent half an hour slogging through this test to
> figure out what's going wrong. IIUC, the problem is essentially that the test
> instructs the vCPU worker to continue _after_ the last iteration, and _then_ sets
> host_quit, which results in the vCPU running one extra (unvalidated) iteration.
>
> For the other modes, which stop if and only if vcpu_sync_stop_requested is set,
> the extra iteration is a non-issue. But because the dirty ring variant stops
> after every exit (to purge the ring), it hangs without an extra "continue".
>
> So rather than blindly fire off an extra sem_vcpu_cont that may or may not be
> consumed, I believe the test can simply set host_quit _before_ the final "continue"
> so that the vCPU worker doesn't run an extra iteration.
>
> I ran the below with 1000 loops of "for (i = 0; i < LOG_MODE_NUM; i++)" and so
> no issues. I didn't see the assert you hit, but without the fix, I did see this
> fire within a few loops (less than 5 I think)l
>
> assert(host_log_mode == LOG_MODE_DIRTY_RING ||
> atomic_read(&vcpu_sync_stop_requested) == false);
>
> I'll post this as two patches: one to fix the bug, and a second to have the
> LOG_MODE_DIRTY_RING variant clear vcpu_sync_stop_requested so that the above
> assert() can be modified as below.
Scratch patch 2, I'm pretty sure the vCPU worker can race with the main task and
clear vcpu_sync_stop_requested just before the main task sets it, which would
result in a false positive. I didn't see any failures, but I'm 99% certain the
race exists. I suspect there are some other warts in the test that would
complicate attempts to clean things up, so for now I'l just post the fix for
the imbalance bug.
Powered by blists - more mailing lists