[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.0902280904271.3111@localhost.localdomain>
Date: Sat, 28 Feb 2009 09:16:21 -0800 (PST)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Ingo Molnar <mingo@...e.hu>
cc: Nick Piggin <nickpiggin@...oo.com.au>,
Salman Qazi <sqazi@...gle.com>, davem@...emloft.net,
linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>, Andi Kleen <andi@...stfloor.org>
Subject: Re: [patch] x86, mm: pass in 'total' to
__copy_from_user_*nocache()
On Sat, 28 Feb 2009, Ingo Molnar wrote:
>
> Can you suggest some other workload that should show sensitivity
> to this detail too? Like a simple write() loop of non-4K-sized
> files or so?
I bet you can find it, but I also suspect that it will depend quite a bit
on the microarchitecture. What does 'movntq' actually _do_ on different
CPU's (bypass L1 or L2 or just turn the L1 cache policy to "write through
and invalidate")? How expensive is the sfence when there are still stores
in the write buffer? Does 'movqnt' even use the write buffer for cached
stores, or is doing some special path the the last-level cache?
If you want to be really subtle, ask questions like what are the
implications for last-level caches that are inclusive? The last-level
cache would take not just the new write, but it also has logic to make
sure that it's a superset of the inner caches, so what does that do to
replacement policy for that cache? Or does it cause invalidations in the
inner caches?
Non-temporal stores are really quite different from normal stores.
Depending on microarchitecture, that may be totally a non-issue (bypassing
the L1 may be trivial and have no impact on anything else at all). Or it
could be that a movntq is really expensive because it needs to do odd
things.
So if you want to test this, I'd suggest using the same program that did
the 256-byte writes (Unixbench's fstime thing), but just change the
numbers, and just try different things. But I'd _also_ suggest that if
you're going for anything more complicated (ie if you really want to
have a good argument for that 'total_size' thing), then you should try out
at least three different microarchitectures.
The "different" ones would be at a minimum P4, Core2 and Opteron. They
really could have very different behavior.
I suspect Core2 and Core i7 are fairly similar, but at the same time Ci7
has that L3 cache thing, so it's quite possible that movntq is actually
fundamentally different (does it bypass both L1 and L2? If so, latencies
to the L3 are _much_ longer to Ci7 than the very cheap L2 latencies on
C2).
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists