[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.1.10.0806041956030.3473@woody.linux-foundation.org>
Date: Wed, 4 Jun 2008 20:08:37 -0700 (PDT)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Nick Piggin <npiggin@...e.de>
cc: Ingo Molnar <mingo@...e.hu>, David Howells <dhowells@...hat.com>,
Ulrich Drepper <drepper@...hat.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 0/3] 64-bit futexes: Intro
On Thu, 5 Jun 2008, Nick Piggin wrote:
>
> I'd have thought that for a case like this, you'd simply hit the store
> alias logic and store forwarding would stall because it doesn't have
> the full data.
That's _one_ possible implementation.
Quite frankly, I think it's the less likely one. It's much more likely
that the cache read access and the store buffer probe happen in parallel
(this is a really important hotpath for any CPU, but even more so x86
where there are more of loads and stores that are spills). And then the
store buffer logic would return the data and a bytemask mask (where the
mask would be all zeroes for a miss), and the returned value is just the
appropriate mix of the two.
> I'd like to know for sure.
You'd have to ask somebody very knowledgeable inside Intel and AMD, and it
is quite likely that different microarchitectures have different
approaches...
> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more imrpovement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.
Oh, absolutely, the perfect algorithm would actually get the right answer
and notice that the cacheline got evicted, and retried the whole sequence
such that it is coherent.
But we do know that Intel expressly documents loads and stores to pass
each other and documents the fact that the store buffer is there. So I bet
that this is visible in some micro-architecture, even if it's not
necessarily visible in _all_ of them.
The recent Intel memory ordering whitepaper makes it very clear that loads
can pass earlier stores and in particular that the store buffer allows
intra-processor forwarding to subsequent loads (2.4 in their whitepaper).
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure
it isn't. The store->load forwarding is such a critical performance issue
that I can pretty much guarantee that it doesn't always hit the cacheline.
Of course, the partial store forwarding case is not nearly as important,
and stalling is quite a reasonable implementation approach. I just
personally suspect that doing the unconditional byte-masking is actually
_simpler_ to implement than the stall, so..
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists