[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAOGi=dNXH2=dnqpjUEnLE8wkzQLdkEkBfQ3fBeq1jY71TFL8Gg@mail.gmail.com>
Date: Mon, 31 Dec 2012 15:52:49 +0800
From: Ling Ma <ling.ma.program@...il.com>
To: mingo@...e.hu
Cc: hpa@...or.com, tglx@...utronix.de, linux-kernel@...r.kernel.org
Subject: Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
Hi Ingo,
By netperf we did double check on older Nehalem platform too as below:
O2 NHM
Performance counter stats for 'netperf' (3 runs):
3779.262214 task-clock # 0.378 CPUs utilized
( +- 0.37% )
47,580 context-switches # 0.013 M/sec
( +- 0.59% )
0 cpu-migrations # 0.000 K/sec
321 page-faults # 0.085 K/sec
( +- 0.18% )
8,885,976,365 cycles # 2.351 GHz
( +- 0.37% )
4,572,094,199 stalled-cycles-frontend # 51.45% frontend
cycles idle ( +- 1.27% )
1,347,935,497 stalled-cycles-backend # 15.17% backend
cycles idle ( +- 2.02% )
6,564,928,770 instructions # 0.74 insns per cycle
# 0.70 stalled cycles
per insn ( +- 0.33% )
1,196,254,990 branches # 316.531 M/sec
( +- 0.33% )
6,434,145 branch-misses # 0.54% of all
branches ( +- 0.42% )
10.009993130 seconds time elapsed
( +- 0.04% )
87380 16384 16384 10.00 16727.94
Os NHM
Performance counter stats for 'netperf' (3 runs):
3793.965782 task-clock # 0.379 CPUs utilized
( +- 0.24% )
59,124 context-switches # 0.016 M/sec
( +- 0.02% )
0 cpu-migrations # 0.000 K/sec
321 page-faults # 0.085 K/sec
( +- 0.21% )
8,878,307,926 cycles # 2.340 GHz
( +- 0.25% )
4,717,512,228 stalled-cycles-frontend # 53.14% frontend
cycles idle ( +- 0.56% )
1,612,028,376 stalled-cycles-backend # 18.16% backend
cycles idle ( +- 0.58% )
6,273,760,790 instructions # 0.71 insns per cycle
# 0.75 stalled cycles
per insn ( +- 0.02% )
1,144,007,254 branches # 301.533 M/sec
( +- 0.02% )
11,348,742 branch-misses # 0.99% of all
branches ( +- 0.66% )
10.006341837 seconds time elapsed
( +- 0.00% )
During the same time IPC from O2 is 0.74, 0.71 from Os, so the
performance is improved by 4%
The above result verified our thought, O2 is better than Os on Nehalem
because Nehalem use legacy instruction fetch, and -falign-functions,
-falign-jumps, -falign-loops, -falign-labels are useful to improve
frontend throuput and Os is good for Sandy Bridge resulted from
decoded cache.
Any comments are appreciate.
Thanks & Best Wish for coming year!
Ling
2012/12/31, ling.ma.program@...il.com <ling.ma.program@...il.com>:
> From: Ma Ling <ling.ml@...pay.com>
>
> Currently we use O2 as compiler option for better performance,
> although it will enlarge code size, in modern CPUs larger instructon
> and unified cache, sophisticated instruction prefetch weaken instruction
> cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
> -falign-loops, -falign-labels are very helpful to improve CPU front-end
> throughput because CPU fetch instruction by 16 aligned–bytes code block
> per cycle.
>
> In order to save power and get higher performance, Sandy Bridge
> starts to introduce decoded-cache, instructions will be kept in it
> after decode stage. When CPU refetches the instruction, decoded cache could
> provide 32 aligned-bytes instruction block, instead of 16 bytes from
> I-cache,
> fewer branch miss penalty resulted from shorter pipeline. It requires hot
> code should be put into decoded cache as possible we can. Sandy Bridge,
> Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
> should be better than O2 on them.
>
> Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> respectively. The results show Os improve performance netperf 4.8%,
> 2.7% for volano as below
>
> O2 + netperf
> Performance counter stats for 'netperf' (3 runs):
>
> 5416.157986 task-clock # 0.541 CPUs utilized
> ( +- 0.19% )
> 348,249 context-switches # 0.064 M/sec
> ( +- 0.17% )
> 0 CPU-migrations # 0.000 M/sec
> ( +- 0.00% )
> 353 page-faults # 0.000 M/sec
> ( +- 0.16% )
> 13,166,254,384 cycles # 2.431 GHz
> ( +- 0.18% )
> 8,827,499,807 stalled-cycles-frontend # 67.05% frontend cycles idle
> ( +- 0.29% )
> 5,951,234,060 stalled-cycles-backend # 45.20% backend cycles idle
> ( +- 0.44% )
> 8,122,481,914 instructions # 0.62 insns per cycle
> # 1.09 stalled cycles per
> insn ( +- 0.17% )
> 1,415,864,138 branches # 261.415 M/sec
> ( +- 0.17% )
> 16,975,308 branch-misses # 1.20% of all branches
> ( +- 0.61% )
>
> 10.007215371 seconds time elapsed
> ( +- 0.03% )
>
> Os + netperf
>
> Performance counter stats for 'netperf' (3 runs):
>
> 5395.386704 task-clock # 0.539 CPUs utilized
> ( +- 0.14% )
> 345,880 context-switches # 0.064 M/sec
> ( +- 0.25% )
> 0 CPU-migrations # 0.000 M/sec
> ( +- 0.00% )
> 354 page-faults # 0.000 M/sec
> ( +- 0.00% )
> 13,142,706,297 cycles # 2.436 GHz
> ( +- 0.23% )
> 8,379,382,641 stalled-cycles-frontend # 63.76% frontend cycles idle
> ( +- 0.50% )
> 5,513,722,219 stalled-cycles-backend # 41.95% backend cycles idle
> ( +- 0.71% )
> 8,554,202,795 instructions # 0.65 insns per cycle
> # 0.98 stalled cycles per
> insn ( +- 0.25% )
> 1,530,020,505 branches # 283.579 M/sec
> ( +- 0.25% )
> 17,710,406 branch-misses # 1.16% of all branches
> ( +- 1.00% )
>
> 10.004859867 seconds time elapsed
>
> During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62,
> Os improved performance 4.8%
>
> O2 + volano
> Performance counter stats for './loopclient.sh openjdk' (3 runs):
>
> 210627.115313 task-clock # 0.781 CPUs utilized
> ( +- 0.92% )
> 13,812,610 context-switches # 0.066 M/sec
> ( +- 0.17% )
> 2,352,755 CPU-migrations # 0.011 M/sec
> ( +- 0.84% )
> 208,333 page-faults # 0.001 M/sec
> ( +- 1.58% )
> 525,627,073,405 cycles # 2.496 GHz
> ( +- 0.96% )
> 428,177,571,365 stalled-cycles-frontend # 81.46% frontend cycles idle
> ( +- 1.09% )
> 370,885,224,739 stalled-cycles-backend # 70.56% backend cycles idle
> ( +- 1.18% )
> 187,662,577,544 instructions # 0.36 insns per cycle
> # 2.28 stalled cycles per
> insn ( +- 0.31% )
> 35,684,976,425 branches # 169.423 M/sec
> ( +- 0.45% )
> 1,062,086,942 branch-misses # 2.98% of all branches
> ( +- 0.08% )
>
> 269.764578435 seconds time elapsed
>
> Os + volano
> Performance counter stats for './loopclient.sh openjdk' (3 runs):
>
> 209545.786941 task-clock # 0.778 CPUs utilized
> ( +- 0.66% )
> 13,864,142 context-switches # 0.066 M/sec
> ( +- 0.29% )
> 2,326,826 CPU-migrations # 0.011 M/sec
> ( +- 0.83% )
> 205,575 page-faults # 0.001 M/sec
> ( +- 2.63% )
> 523,366,588,452 cycles # 2.498 GHz
> ( +- 0.75% )
> 419,200,472,430 stalled-cycles-frontend # 80.10% frontend cycles idle
> ( +- 0.86% )
> 362,044,374,737 stalled-cycles-backend # 69.18% backend cycles idle
> ( +- 0.96% )
> 193,274,857,837 instructions # 0.37 insns per cycle
> # 2.17 stalled cycles per
> insn ( +- 0.51% )
> 37,657,832,686 branches # 179.712 M/sec
> ( +- 0.42% )
> 1,061,005,300 branch-misses # 2.82% of all branches
> ( +- 0.86% )
>
> 269.410275674 seconds time elapsed
> ( +- 0.06% )
>
> During the same time (269.410275674 seconds) IPC from Os is 0.37, O2 is
> 0.36, Os improved performance 2.7%
>
> So our initial conclusion is Os is better than O2 for current & coming x86
> CPUs.
> If I was wrong, please correct me.
>
> Thanks
> Ling
>
View attachment "nhm-cpu-info" of type "text/plain" (7050 bytes)
Powered by blists - more mailing lists