linux-kernel - Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 03 Jan 2011 14:06:36 -0500
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Gregory Haskins <ghaskins@...ell.com>
Cc:	Lai Jiangshan <laijs@...fujitsu.com>, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>,
	ThomasGleixner <tglx@...utronix.de>,
	Peter Morreale <PMorreale@...ell.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

On Tue, 2010-12-28 at 07:06 -0700, Gregory Haskins wrote:
> >>> On 12/23/2010 at 11:54 PM, in message

> > Sure, but as I said, it is mostly broken anyway. I could even insert
> > some tracepoints to show that this is always missed (heck I'll add an
> > unlikely and do the branch profiler ;-)
> 
> Well, I think that would be a good datapoint and is one of the things I'd like to see.

OK, here it is, after running a simple "dbench 10":

 correct incorrect  %        Function                  File              Line
 ------- ---------  -        --------                  ----              ----
     150   726979  99 wakeup_next_waiter             rtmutex.c            581

And the patch I added:

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..dacdcb6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -578,7 +578,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 		smp_mb();

 		/* If !RUNNING && !RUNNING_MUTEX */
-		if (pendowner->state & ~TASK_RUNNING_MUTEX)
+		if (unlikely(pendowner->state & ~TASK_RUNNING_MUTEX))
 			wake_up_process_mutex(pendowner);
 	}

Interesting that we hit 150 times that the new owner was already awake.
Perhaps it was because the new owner was about to spin on a spinlock
after it had put itself into the TASK_INTERRUPTIBLE state, and it was
woken up by someone else?

> 
> > 
> > The reason is that adaptive spinners spin in some other state than
> > TASK_RUNNING, thus it does not help adaptive spinners at all. I first
> > tried to fix that, but it made dbench run even slower.
> 
> This is why I am skeptical.  You are essentially asserting there are two issues here, IIUC:
> 
> 1) The intent of avoiding a wakeup is broken and we take the double whammy of a mb()
> plus the wakeup() anyway.

Yep, this is what I believe is happening.

> 
> 2) mb() is apparently slower than wakeup().

Forget this point, as #1 seems to be the main issue. Also, I'm not sure
it is safe to "fix" this, as I started to try.

> 
> I agree (1) is plausible, though I would like to see the traces to confirm.  Its been a long time
> since I looked at that code, but I think the original code either ran in RUNNING_MUTEX and was
> inadvertently broken in the mean time or the other cpu would have transitioned to RUNNING on
> its own when we flipped the owner before the release-side check was performed.  Or perhaps
> we just plain screwed this up and it was racy ;)  I'm not sure.  But as Peter (M) stated, it seems
> like a shame to walk away from the concept without further investigation.  I think everyone can
> agree that at the very least, if it is in fact taking a double whammy we should fix that.

[ cut out all the 2's since I'm not worried about that now, even if it 
  is not a problem. ]

Now, the way I was going to fix this is to have the adaptive wait be in
TASK_RUNNING state, and then change the state iff we are going to sleep.

But this can be a bit more race. Especially with Lai's new changes. With
the new changes, the waiter->task does not get nulled anymore.

The test to take the lock, now needs to be under the lock->wait_lock
spinlock. We have to grab that lock, and see if the owner is null, and
that we are the next waiter in the queue. Thus, on taking the lock we
need to have something like this:

	if (adaptive_wait(&waiter, orig_owner))
		sleep = 1;
	else
		sleep = 0;

	if (sleep)
		raw_spin_lock(&lock->wait_lock);
		saved_state = rt_set_current_block_state(saved_state);
		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
			sleep = 0;
		raw_spin_unlock(&lock->wait_lock);
		if (sleep)
			schedule_rt_mutex(lock);
		saved_state = rt_restore_current_blocked_state(saved_state);
	}

Otherwise we can risk the wakeup_next_waiter() missing the wakeup.

To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
again if we can still the lock, and if not then sleep. Otherwise, if a
wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
then we miss the wake up all together.

I can do this change, and see what impact it makes.

I'm also curious if this ever worked? If it did not, then are you sure
your tests that show the benefit of it was true. I don't have a large
scale box at my disposal ATM, so I can only see what this does on 4way
machines.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/