[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081104163636.GA20534@elte.hu>
Date: Tue, 4 Nov 2008 17:36:36 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Alexander van Heukelum <heukelum@...tmail.fm>
Cc: Cyrill Gorcunov <gorcunov@...il.com>,
Alexander van Heukelum <heukelum@...lshack.com>,
LKML <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>, lguest@...abs.org,
jeremy@...source.com, Steven Rostedt <srostedt@...hat.com>,
Mike Travis <travis@....com>, Andi Kleen <andi@...stfloor.org>
Subject: Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes
* Alexander van Heukelum <heukelum@...tmail.fm> wrote:
> I wonder how the time needed for reading the GDT segments balances
> against the time needed due to the extra redirection due to running
> the stubs. I'ld be interested if the difference can be measured with
> the current implementation. (I really need to highjack a machine to
> do some measurements; I hoped someone would do it before I got to it
> ;) )
>
> Even if some CPU's have some internal optimization for the case
> where the gate segment is the same as the current one, I wonder if
> it is really important... Interrupts that occur while the processor
> is running userspace already cause changing segments. They are more
> likely to be in cache, maybe.
there are three main factors:
- Same-value segment loads are optimized on most modern CPUs and can
give a few cycles (2-3) advantage. That might or might not apply to
the microcode that does IRQ entry processing. (A cache miss will
increase the cost much more but that is true in general as well)
- A second effect is that the changed data structure layout: a more
compressed GDT entry (6 bytes) against a more spread out (~7 bytes,
not aligned) interrupt trampoline. Note that the first one is data
cache the second one is instruction cache - the two have different
sizes, different implementations and different hit/miss pressures.
Generally the instruction-cache is the more precious resource and we
optimize for that first, for data cache second.
- A third effect is branch prediction: currently we are fanning
out all the vectors into ~240 branches just to recover a single
constant in essence. That is quite wasteful of instruction cache
resources, because from the logic side it's a data constant, not a
control flow difference. (we demultiplex that number into an
interrupt handler later on, but the CPU has no knowledge of that
relationship)
... all in one, the situation is complex enough on the CPU
architecture side for it to really necessiate a measurement in
practice, and that's why i have asked you to do them: the numbers need
to go hand in hand with the patch submission.
My estimation is that if we do it right, your approach will behave
better on modern CPUs (which is what matters most for such things),
especially on real workloads where there's a considerable
instruction-cache pressure. But it should be measured in any case.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists