linux-kernel - Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 21 May 2015 13:36:17 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Denys Vlasenko <dvlasenk@...hat.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andy Lutomirski <luto@...capital.net>,
	Davidlohr Bueso <dave@...olabs.net>,
	Peter Anvin <hpa@...or.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Borislav Petkov <bp@...en8.de>,
	Peter Zijlstra <peterz@...radead.org>,
	"Chandramouleeswaran, Aswin" <aswin@...com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Brian Gerst <brgerst@...il.com>,
	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Jason Low <jason.low2@...com>,
	"linux-tip-commits@...r.kernel.org" 
	<linux-tip-commits@...r.kernel.org>,
	Arjan van de Ven <arjan@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache
 footprint of kernel functions


* Denys Vlasenko <dvlasenk@...hat.com> wrote:

> I was thinking about Ingo's AMD results:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed
> linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed
> linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed
> 
> AMD is almost perfect. Having no alignment at all still works very 
> well. [...]

Not quite. As I mentioned it in my post, the 'time elapsed' numbers 
were very noisy in the AMD case - and you've cut off the stddev column 
that shows this. Here is the full data:

 linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
 linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
 linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )

2-3% of stddev for a 3.7% speedup is not conclusive.

What you should use instead is the cachemiss counts, which is a good 
proxy and a lot more stable statistically:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)

which shows that 64 bytes alignment still generates a better I$ layout 
than tight packing, resulting in 4.3% fewer I$ misses.

On Intel it's more pronounced:

 linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
 linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)

12% difference. Note that the Intel workload is running on SSDs which 
makes the cache footprint several times larger, and the workload is 
more realistic as well than the AMD test that was running in tmpfs.

I think it's a fair bet to assume that the AMD system will show a 
similar difference if it were to run the same workload.

Allowing smaller functions to be cut in half by cacheline boundaries 
looks like a losing strategy, especially with larger workloads.

The modified scheme I suggested: 64 bytes alignment + intelligent 
packing might do even better than dumb 64 bytes alignment.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/