lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 4 Jun 2008 20:08:37 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Nick Piggin <npiggin@...e.de>
cc:	Ingo Molnar <mingo@...e.hu>, David Howells <dhowells@...hat.com>,
	Ulrich Drepper <drepper@...hat.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 0/3] 64-bit futexes: Intro



On Thu, 5 Jun 2008, Nick Piggin wrote:
> 
> I'd have thought that for a case like this, you'd simply hit the store
> alias logic and store forwarding would stall because it doesn't have
> the full data.

That's _one_ possible implementation. 

Quite frankly, I think it's the less likely one. It's much more likely 
that the cache read access and the store buffer probe happen in parallel 
(this is a really important hotpath for any CPU, but even more so x86 
where there are more of loads and stores that are spills). And then the 
store buffer logic would return the data and a bytemask mask (where the 
mask would be all zeroes for a miss), and the returned value is just the 
appropriate mix of the two.

> I'd like to know for sure.

You'd have to ask somebody very knowledgeable inside Intel and AMD, and it 
is quite likely that different microarchitectures have different 
approaches...

> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more imrpovement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.

Oh, absolutely, the perfect algorithm would actually get the right answer 
and notice that the cacheline got evicted, and retried the whole sequence 
such that it is coherent. 

But we do know that Intel expressly documents loads and stores to pass 
each other and documents the fact that the store buffer is there. So I bet 
that this is visible in some micro-architecture, even if it's not 
necessarily visible in _all_ of them.

The recent Intel memory ordering whitepaper makes it very clear that loads 
can pass earlier stores and in particular that the store buffer allows 
intra-processor forwarding to subsequent loads (2.4 in their whitepaper). 
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure 
it isn't. The store->load forwarding is such a critical performance issue 
that I can pretty much guarantee that it doesn't always hit the cacheline.

Of course, the partial store forwarding case is not nearly as important, 
and stalling is quite a reasonable implementation approach. I just 
personally suspect that doing the unconditional byte-masking is actually 
_simpler_ to implement than the stall, so..

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ