linux-kernel - Re: [Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20081129154516.GA26579@mailshack.com>
Date:	Sat, 29 Nov 2008 16:45:16 +0100
From:	Alexander van Heukelum <heukelum@...lshack.com>
To:	Avi Kivity <avi@...hat.com>, "H. Peter Anvin" <hpa@...or.com>
Cc:	Alexander van Heukelum <heukelum@...tmail.fm>, lguest@...abs.org,
	jeremy@...source.com, LKML <linux-kernel@...r.kernel.org>,
	Mike Travis <travis@....com>,
	Cyrill Gorcunov <gorcunov@...il.com>,
	Steven Rostedt <srostedt@...hat.com>,
	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [Lguest] [PATCH RFC/RFB] x86_64,	i386: interrupt dispatch changes

On Fri, Nov 28, 2008 at 09:48:10PM +0100, Alexander van Heukelum wrote:
> On Thu, Nov 27, 2008 at 12:13:43PM +0200, Avi Kivity wrote:
> > H. Peter Anvin wrote:
> > >
> > >>I suspect we could get it down to three bytes, by sharing the last 
> > >>byte of the four-byte call sequence with the first byte of the next:
> > >>
> > >> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
> > >>
> > >>Every three bytes a new stub begins; it's a four-byte call to offset 
> > >>0x6703 relative to the beginning of the first stub.
> > >>
> > >>Can anyone better 24 bits/stub?
> > >
> > >On the entirely silly level...
> > >
> > >CC xx
> > 
> > Nice.  Can actually go to zero, by pointing the IDT at (unmapped_area + 
> > vector), and deducing the vector in the page fault handler from cr2.
> 
> Hi all,
> 
> We started the discussion with doing away with the whole jump
> array entirely, by changing the value of the CS index in the
> IDT. This needs the GDT to be extended with 256 entries, but an
> entire page (space for 512 entries) was already reserved anyhow!
> I think there is still some problem with the patch I sent due to
> some code depending on certain values of the CS index, but the
> system I've benchmarked on seemed to behave.
> 
> I did a set of benchmarks on an 8-way Xeon in 64-bit mode. The
> system was loaded with an instance of bonnie++ pinned to processor
> 0, and all 8 processors were running a program doing (almost)
> adjacent rdtsc's. Bonnie++ causes interrupts and the latencies
> due to these show up as larger time intervals. Complete runs of
> bonnie++ in fast mode were sampled this way for a current -rc6
> kernel and an -rc6 kernel plus my patch. The total sampling time
> was 30 minutes for each run. Per kernel I did one run as a warm-up
> and another two runs to measure the latencies. The results for
> measured latencies between 5 and 1000 microseconds are shown in
> the attached graph. Above 1000 microseconds there is only one big
> contribution: at 40000 microseconds ;). The surface below the graph
> is a measure of time.
> 
> Observations (for this test load!):
> 
> Near 200, 250 and 350 microseconds, the peaks shift to longer
> latencies for the cs-changing code by about 10 microseconds,
> but the total time spent is pretty much constant.
> 
> The highest latencies for the cs-changing code are near 600
> and 650 microseconds. The highest latencies for the current
> code are near 800 and 850 microseconds.
> 
> The total surface of the graphs between 5 and 1000 microseconds
> is within an error estimate of 1% equal for both cases, and is
> about 0.69% of the total time.
> 
> Most time is spent measuring 'latencies' of less than 5 micro-
> seconds, since bonnie++ is taking only about 5% cpu time on a
> single cpu most of the time, and only up to 50% on a single cpu
> during a short time in the file creation benchmark.

I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
too. Same machine. It's significantly better than the other two
options in terms of speed. It takes about 7% less cpu to handle
the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
I'll let interpreting the histogram to someone else ;).

Greetings,
	Alexander



Download attachment "load.png" of type "image/png" (11850 bytes)