[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.1.10.0808131119290.3462@nehalem.linux-foundation.org>
Date: Wed, 13 Aug 2008 11:27:14 -0700 (PDT)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
cc: Steven Rostedt <rostedt@...dmis.org>,
Jeremy Fitzhardinge <jeremy@...p.org>,
Andi Kleen <andi@...stfloor.org>,
LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
David Miller <davem@...emloft.net>,
Roland McGrath <roland@...hat.com>,
Ulrich Drepper <drepper@...hat.com>,
Rusty Russell <rusty@...tcorp.com.au>,
Gregory Haskins <ghaskins@...ell.com>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
"Luis Claudio R. Goncalves" <lclaudio@...g.org>,
Clark Williams <williams@...hat.com>
Subject: Re: Efficient x86 and x86_64 NOP microbenchmarks
On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
>
> I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> Intel Pentium 4 boxes to compare a baseline
Note that the biggest problems of a jump-based nop are likely to happen
when there are I$ misses and/or when there are other jumps involved. Ie a
some microarchitectures tend to have issues with jumps to jumps, or when
there are multiple control changes in the same (possibly partial)
cacheline because the instruction stream prediction may be predecoded in
the L1 I$, and multiple branches in the same cacheline - or in the same
execution cycle - can pollute that kind of thing.
So microbenchmarking this way will probably make some things look
unrealistically good.
On the P4, the trace cache makes things even more interesting, since it's
another level of I$ entirely, with very different behavior for the hit
case vs the miss case.
And I$ misses for the kernel are actually fairly high. Not in
microbenchmarks that tend to have very repetive behavior and a small I$
footprint, but in a lot of real-life loads the *bulk* of all action is in
user space, and then the kernel side is often invoced with few loops (the
kernel has very few loops indeed) and a cold I$.
So your numbers are interesting, but it would be really good to also get
some info from Intel/AMD who may know about microarchitectural issues for
the cases that don't show up in the hot-I$-cache environment.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists