lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Wed, 4 Mar 2009 14:37:15 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Arjan van de Ven <arjan@...radead.org>,
	Andi Kleen <andi@...stfloor.org>,
	David Miller <davem@...emloft.net>, sqazi@...gle.com,
	linux-kernel@...r.kernel.org, tglx@...utronix.de
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

On Tuesday 03 March 2009 20:02:52 Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@...oo.com.au> wrote:
> > On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> > > On Mon, 2 Mar 2009, Nick Piggin wrote:
> > > > I would expect any high performance CPU these days to combine entries
> > > > in the store queue, even for normal store instructions (especially
> > > > for linear memcpy patterns). Isn't this likely to be the case?
> > >
> > > None of this really matters.
> >
> > Well that's just what I was replying to. Of course
> > nontemporal/uncached stores can't avoid cc operations either,
> > but somebody was hoping that they would avoid the
> > write-allocate / RMW behaviour. I just replied because I think
> > that modern CPUs can combine stores in their store queues to
> > get the same result for cacheable stores.
> >
> > Of course it doesn't make it free especially if it is a cc
> > protocol that has to go on the interconnect anyway. But
> > avoiding the RAM read is a good thing anyway.
>
> Hm, why do you assume that there is a RAM read?

I don't ;) Re-read back a few posts. I thought that nontemporal stores
would not necessarily have an advantage with avoiding write allocate
behaviour. Because I thought CPUs should combine stores in their store
buffer.

Doing some simple tests is showing that a nontemporal stores takes about
0.7 the time of doing a rep stosq here, if the destination is much larger
than cache. So the CPU isn't quite as clever as I assumed.

I can't find any references to back up my assumption, but I thought I
heard it somewhere. It might have been in relation to some powerpc CPUs
not requiring their cacheline clear instruction because they combine
store buffer entries. But I could be way off.


> A sufficiently
> advanced x86 CPU will have good string moves with full cacheline
> transfers - removing partial cachelines and removing the need
> for the physical read.

I thought this should be the case even with a plain sequence of normal
stores. But that's taking about 1.4 the time of rep sto, so again
maybe I overestimate. I don't know.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ