linux-kernel - Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <4D2AD6200200005A0007A192@novprvoes0310.provo.novell.com>
Date:	Mon, 10 Jan 2011 07:49:20 -0700
From:	"Gregory Haskins" <ghaskins@...ell.com>
To:	"Steven Rostedt" <rostedt@...dmis.org>
Cc:	"Lai Jiangshan" <laijs@...fujitsu.com>,
	"Ingo Molnar" <mingo@...e.hu>,
	"Peter Zijlstra" <peterz@...radead.org>,
	"ThomasGleixner" <tglx@...utronix.de>,
	"Peter Morreale" <PMorreale@...ell.com>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock
 wakeup

Hey Steve,

Just getting back online now....

>>> On 1/3/2011 at 02:06 PM, in message
<1294081596.3948.192.camel@...dalf.stny.rr.com>, Steven Rostedt
<rostedt@...dmis.org> wrote: 
> On Tue, 2010-12-28 at 07:06 -0700, Gregory Haskins wrote:
>> >>> On 12/23/2010 at 11:54 PM, in message
> 
>> > Sure, but as I said, it is mostly broken anyway. I could even insert
>> > some tracepoints to show that this is always missed (heck I'll add an
>> > unlikely and do the branch profiler ;-)
>> 
>> Well, I think that would be a good datapoint and is one of the things I'd 
> like to see.
> 
> OK, here it is, after running a simple "dbench 10":
> 
>  correct incorrect  %        Function                  File              
> Line
>  ------- ---------  -        --------                  ----              ----
>      150   726979  99 wakeup_next_waiter             rtmutex.c            
> 581
> 

Interesting, thanks for gathering the info

 
> 
> Interesting that we hit 150 times that the new owner was already awake.
> Perhaps it was because the new owner was about to spin on a spinlock
> after it had put itself into the TASK_INTERRUPTIBLE state, and it was
> woken up by someone else?

I speculate that what you are seeing is the waiter was an adaptive-spinner and it beat the releaser to change
its state from INTERRUPTIBLE->RUNNING in parallel before the releaser checked the condition.  Its essentially
a race condition that apparently only hits ~1% of the time.  Without further instrumentation, its just a guess,
though.



> 
> 
>> 
>> > 
>> > The reason is that adaptive spinners spin in some other state than
>> > TASK_RUNNING, thus it does not help adaptive spinners at all. I first
>> > tried to fix that, but it made dbench run even slower.
>> 
>> This is why I am skeptical.  You are essentially asserting there are two 
> issues here, IIUC:
>> 
>> 1) The intent of avoiding a wakeup is broken and we take the double whammy 
> of a mb()
>> plus the wakeup() anyway.
> 
> Yep, this is what I believe is happening.
> 
>> 
>> 2) mb() is apparently slower than wakeup().
> 
> Forget this point, as #1 seems to be the main issue. Also, I'm not sure
> it is safe to "fix" this, as I started to try.

ok

> 
> 
>> 
>> I agree (1) is plausible, though I would like to see the traces to confirm.  
> Its been a long time
>> since I looked at that code, but I think the original code either ran in 
> RUNNING_MUTEX and was
>> inadvertently broken in the mean time or the other cpu would have 
> transitioned to RUNNING on
>> its own when we flipped the owner before the release-side check was 
> performed.  Or perhaps
>> we just plain screwed this up and it was racy ;)  I'm not sure.  But as 
> Peter (M) stated, it seems
>> like a shame to walk away from the concept without further investigation.  I 
> think everyone can
>> agree that at the very least, if it is in fact taking a double whammy we 
> should fix that.
> 
> [ cut out all the 2's since I'm not worried about that now, even if it 
>   is not a problem. ]
> 
> 
> Now, the way I was going to fix this is to have the adaptive wait be in
> TASK_RUNNING state, and then change the state iff we are going to sleep.
> 
> But this can be a bit more race. Especially with Lai's new changes. With
> the new changes, the waiter->task does not get nulled anymore.
> 
> The test to take the lock, now needs to be under the lock->wait_lock
> spinlock. We have to grab that lock, and see if the owner is null, and
> that we are the next waiter in the queue. Thus, on taking the lock we
> need to have something like this:
> 
> 
> 	if (adaptive_wait(&waiter, orig_owner))
> 		sleep = 1;
> 	else
> 		sleep = 0;
> 
> 	if (sleep)
> 		raw_spin_lock(&lock->wait_lock);
> 		saved_state = rt_set_current_block_state(saved_state);
> 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> 			sleep = 0;
> 		raw_spin_unlock(&lock->wait_lock);
> 		if (sleep)
> 			schedule_rt_mutex(lock);
> 		saved_state = rt_restore_current_blocked_state(saved_state);
> 	}
> 
> Otherwise we can risk the wakeup_next_waiter() missing the wakeup.

Yep, you definitely need something like your proposal above if you want to mess with the state, I agree.

> 
> To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
> we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
> again if we can still the lock, and if not then sleep. Otherwise, if a
> wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
> then we miss the wake up all together.
> 
> I can do this change, and see what impact it makes.
> 
> I'm also curious if this ever worked?

Well, I _thought_ so, but it was so long ago I don't remember to specific level of granularity of the unit tests.

> If it did not, then are you sure
> your tests that show the benefit of it was true. I don't have a large
> scale box at my disposal ATM, so I can only see what this does on 4way
> machines.

Yeah, me either.  At the time we had 16 and 32 core boxes, plus your 64 core box.  There were certainly
substantial improvements throughout the series on these machines (which I think you also verified
independently, or you wouldn't have accepted them ;).  Given that what you found amounts to a race,
I suppose the code, even if it was racy from day 1, may have had a positive impact for certain workloads
since the race will be environment specific.

I digress: whether it worked once and broke in the meantime or was always broken is purely an exercise
in evaluating my stupidity ;)  Ill leave that as an exercise to the community.  The important thing is to either
fix the optimization (e.g. with a patch like yours above) or remove the optimization outright.  

Bottom line: Nice find, and let me know if you need me to do anything.

-Greg


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/