[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45075829.701@goop.org>
Date: Tue, 12 Sep 2006 18:00:25 -0700
From: Jeremy Fitzhardinge <jeremy@...p.org>
To: Arjan van de Ven <arjan@...radead.org>
CC: akpm@...l.org, ak@...e.de, mingo@...e.hu,
linux-kernel@...r.kernel.org, Michael.Fetterman@...cam.ac.uk,
Ian Campbell <Ian.Campbell@...Source.com>
Subject: Re: i386 PDA patches use of %gs
Arjan van de Ven wrote:
> The advantage of this is very simple: %fs will be 0 for userspace most
> of the time. Putting 0 in a segment register is cheap for the cpu,
> putting anything else in is quite expensive (a LOT of security checks
> need to happen). As such I would MUCH rather see that the i386 PDA
> patches use %fs and not %gs...
Hi Arjan,
I spent some time trying to measure this, to see if there really is a
difference between loading a null selector vs a non-null.
The short answer is no, I couldn't measure any difference at all, on any
CPU going back to a P166, up to a current Core Duo machine.
I used a usermode test model of the entry.S code in order to make it
easier to test on more machines. The basic inner loop is:
push %segreg
mov %selectorreg, %segreg
add $1,%segreg:offset # use the segment register
pop %segreg
I also unrolled the loop to minimize the overhead from anything else.
This is clearly much more segment-register intense than any real use, so
I'm hoping that this should exacerbate any performance differences. I
also tried to put cpuid in the loop in order to approximate the
synchronizing effects of taking an exception, but it didn't seem to make
much difference other than slow everything down by a constant amount
(the cpuid slowdown swamped pretty much everything else on Intel CPUs,
but was much less intrusive on the Athlon64).
I tried the push/load/pop sequence with both %fs and %gs, where pop %fs
would result in a null selector load, and pop %gs would load the normal
userspace TLS selector.
I also tried loading 3 types of selector after the push:
* the normal usermode ds selector, on the grounds that the CPU might
be more efficient in reloading a selector which is already in use
* an ldt selector, which I thought might be slower since (at least
conceptually) there's an indirection into a different descriptor table
* and a gdt selector (the normally unused second TLS selector)
In general, I got identical results for all of these. There were two
exceptions:
* The 1.8 GHz P4 Northwood was slower loading the LDT selector as
expected, and pop %fs was faster than pop %gs. The GDT and data
selector results were the same independent of %fs or %gs.
* The AMD K6 was consistently *slower* with pop %fs; pop %gs was
faster. I didn't try reversing the uses of %fs and %gs to see if
it was the null selector being slower, or some inherent slowness
in using %fs.
It's possible I got something wrong, and I'm not really measuring what I
think I'm measuring. The main thing that worries me about the results
is that they don't scale much at all in proportion to the clock speed.
Otherwise the results look sensible to me. I'd appreciate it if people
could review the test program to see if I've overlooked something.
So, in summary, I don't think there's much point in switching to %fs. I
may get around to confirming this by doing a %gs->%fs conversion patch,
but given these results that's at a fairly low priority.
I've attached my test program and results.
J
View attachment "time-segops.c" of type "text/x-csrc" (5235 bytes)
View attachment "results-nosync.txt" of type "text/plain" (3165 bytes)
Powered by blists - more mailing lists