[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1001051120430.3630@localhost.localdomain>
Date: Tue, 5 Jan 2010 11:28:57 -0800 (PST)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Christoph Lameter <cl@...ux-foundation.org>
cc: Andi Kleen <andi@...stfloor.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Minchan Kim <minchan.kim@...il.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"hugh.dickins" <hugh.dickins@...cali.co.uk>,
Nick Piggin <nickpiggin@...oo.com.au>,
Ingo Molnar <mingo@...e.hu>
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
On Tue, 5 Jan 2010, Christoph Lameter wrote:
>
> The wait state is the processor being stopped due to not being able to
> access the cacheline. Not the processor spinning in the xadd loop. That
> only occurs if the critical section is longer than the timeout.
You don't know what you're talking about, do you?
Just go and read the source code.
The process is spinning currently in the spin_lock loop. Here, I'll quote
it to you:
LOCK_PREFIX "xaddw %w0, %1\n"
"1:\t"
"cmpb %h0, %b0\n\t"
"je 2f\n\t"
"rep ; nop\n\t"
"movb %1, %b0\n\t"
/* don't need lfence here, because loads are in-order */
"jmp 1b\n"
note the loop that spins - reading the thing over and over - waiting for
_that_ CPU to be the owner of the xadd ticket?
That's the one you have now, only because x86-64 uses the STUPID FALLBACK
CODE for the rwsemaphores!
In contrast, look at what the non-stupid rwsemaphore code does (which
triggers on x86-32):
LOCK_PREFIX " incl (%%eax)\n\t"
/* adds 0x00000001, returns the old value */
" jns 1f\n"
" call call_rwsem_down_read_failed\n"
(that's a "down_read()", which happens to be the op we care most about.
See? That's a single locked "inc" (it avoids the xadd on the read side
because of how we've biased things). In particular, notice how this means
that we do NOT have fifty million CPU's all trying to read the same
location while one writes to it successfully.
Spot the difference?
Here's putting it another way. Which of these schenarios do you think
should result in less cross-node traffic:
- multiple CPU's that - one by one - get the cacheline for exclusive
access.
- multiple CPU's that - one by one - get the cacheline for exclusive
access, while other CPU's are all trying to read the same cacheline at
the same time, over and over again, in a loop.
See the shared part? See the difference? If you look at just a single lock
acquire, it boils down to these two scenarios
- one CPU gets the cacheline exclusively
- one CPU gets the cacheline exclusively while <n> other CPU's are all
trying to read the old and the new value.
It really is that simple.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists