[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFykHb8oyZquBq53MowOLzc5XXtNW4tad+4cTbO0YYFYNQ@mail.gmail.com>
Date: Tue, 28 Apr 2015 09:28:52 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Borislav Petkov <bp@...en8.de>
Cc: "H. Peter Anvin" <hpa@...or.com>,
Andy Lutomirski <luto@...capital.net>,
Andy Lutomirski <luto@...nel.org>, X86 ML <x86@...nel.org>,
Denys Vlasenko <vda.linux@...glemail.com>,
Brian Gerst <brgerst@...il.com>,
Denys Vlasenko <dvlasenk@...hat.com>,
Ingo Molnar <mingo@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>,
Oleg Nesterov <oleg@...hat.com>,
Frederic Weisbecker <fweisbec@...il.com>,
Alexei Starovoitov <ast@...mgrid.com>,
Will Drewry <wad@...omium.org>,
Kees Cook <keescook@...omium.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Mel Gorman <mgorman@...e.com>
Subject: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor
attribute issue
On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov <bp@...en8.de> wrote:
>
> Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
> better than the 0F 1F 00 suggested by the manual (Haha!):
That's which AMD CPU?
On my intel i7-4770S, they are the same cost (I cut down your loop
numbers by an order of magnitude each because I couldn't be arsed to
wait for it, so it might be off by a cycle or two):
Running 60 times, 1000000 loops per run.
nop_0x90 average: 81.065681
nop_3_byte average: 80.230101
That said, I think your benchmark tests the speed of "rdtsc" rather
than the no-ops. Putting the read_tsc inside the inner loop basically
makes it swamp everything else.
> $ taskset -c 3 ./nops
> Running 600 times, 10000000 loops per run.
> nop_0x90 average: 439.805220
> nop_3_byte average: 442.412915
I think that's in the noise, and could be explained by random
alignment of the loop too, or even random factors like "the CPU heated
up, so the later run was slightly slower". The difference between 439
and 442 doesn't strike me as all that significant.
It might be better to *not* inline, and instead make a real function
call to something that has a lot of no-ops (do some preprocessor magic
to make more no-ops in one go). At least that way the alignment is
likely the same for the two cases.
Or if not that, then I think you're better off with something like
p1 = read_tsc();
for (i = 0; i < LOOPS; i++) {
nop_0x90();
}
p2 = read_tsc();
r = (p2 - p1);
because while you're now measuring the loop overhead too, that's
*much* smaller than the rdtsc overhead. So I get something like
Running 600 times, 1000000 loops per run.
nop_0x90 average: 3.786935
nop_3_byte average: 3.677228
and notice the difference between "~80 cycles" and "~3.7 cycles".
Yeah, that's rdtsc. I bet your 440 is about the same thing too.
Btw, the whole thing about "averaging cycles" is not the right thing
to do either. You should probably take the *minimum* cycles count, not
the average, because anything non-minimal means "some perturbation"
(ie interrupt etc).
So I think something like the attached would be better. It gives an
approximate "cycles per one four-byte nop", and I get
[torvalds@i7 ~]$ taskset -c 3 ./a.out
Running 60 times, 1000000 loops per run.
nop_0x90 average: 0.200479
nop_3_byte average: 0.199694
which sounds suspiciously good to me (5 nops per cycle? uop cache and
nop compression, I guess).
Linus
View attachment "t.c" of type "text/x-csrc" (1893 bytes)
Powered by blists - more mailing lists