[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080813153800.GF5853@Krystal>
Date: Wed, 13 Aug 2008 11:38:00 -0400
From: Mathieu Desnoyers <compudj@...stal.dyndns.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Jeremy Fitzhardinge <jeremy@...p.org>,
Andi Kleen <andi@...stfloor.org>,
LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
David Miller <davem@...emloft.net>,
Roland McGrath <roland@...hat.com>,
Ulrich Drepper <drepper@...hat.com>,
Rusty Russell <rusty@...tcorp.com.au>,
Gregory Haskins <ghaskins@...ell.com>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
"Luis Claudio R. Goncalves" <lclaudio@...g.org>,
Clark Williams <williams@...hat.com>
Subject: Re: [PATCH 0/5] ftrace: to kill a daemon
* Mathieu Desnoyers (mathieu.desnoyers@...ymtl.ca) wrote:
> * Steven Rostedt (rostedt@...dmis.org) wrote:
> >
> > On Fri, 8 Aug 2008, Linus Torvalds wrote:
> > >
> > >
> > > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > > >
> > > > Steven Rostedt wrote:
> > > > > I wish we had a true 5 byte nop.
> > > >
> > > > 0x66 0x66 0x66 0x66 0x90
> > >
> > > I don't think so. Multiple redundant prefixes can be really expensive on
> > > some uarchs.
> > >
> > > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
> >
> >
> > A quick meaningless benchmark showed a slight perfomance hit.
> >
>
> Hi Steven,
>
> I tried to run my own tests to see if I could get to know if these
> numbers are actually meaningful at all. My results seems to show that
> there is not any significant difference between the various
> configurations, and actually that the only one tendency I see is that
> the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
> on Intel Xeon. But we would have to run these a bit more often to
> confirm that I guess.
>
> I am just trying to get a sense of whether we are really trying hard to
> optimize something worthless in practice, and to me it looks like it.
> But it could be the architecture I am using that brings these results.
>
> Mathieu
>
> Intel Xeon dual quad-core
> Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
>
> 3/2 nop used :
> K8_NOP3 K8_NOP2
> #define K8_NOP2 ".byte 0x66,0x90\n"
> #define K8_NOP3 ".byte 0x66,0x66,0x90\n"
>
Small correction : my architecture uses the P6_NOP5, which is an atomic
5-bytes nop (just looked at the runtime output of find_nop_table()).
5: nopl 0x00(%eax,%eax,1)
#define P6_NOP5 ".byte 0x0f,0x1f,0x44,0x00,0\n"
But the results stands. Maybe I should try to force a test with a
K8_NOP3 K8_NOP2 nop.
Mathieu
> ** Summary **
>
> Test A : make -j20 2.6.27-rc2 kernel (real time)
> Avg. std.dev
> Case 1 : ftrace not compiled-in. 1m9.76s 0.41s
> Case 2 : 3/2 nops 1m9.95s 0.36s
> Case 3 : 2-bytes jump, offset 0x03 1m9.10s 0.40s
> Case 4 : 5-bytes jump, offset 0x00 1m9.25s 0.34s
>
> Test B : hackbench 15
>
> Case 1 : ftrace not compiled-in. 0.349s 0.007s
> Case 2 : 3/2 nops 0.351s 0.014s
> Case 3 : 2-bytes jump, offset 0x03 0.350s 0.007s
> Case 4 : 5-bytes jump, offset 0x00 0.351s 0.010s
>
>
>
> ** Detail **
>
> * Test A
>
> benchmark : make -j20 2.6.27-rc2 kernel
> make clean; make -j20; make clean done before the tests to prime caches.
> Same .config used.
>
>
> Case 1 : ftrace not compiled-in.
>
> real 1m9.980s
> user 7m27.664s
> sys 0m48.771s
>
> real 1m9.330s
> user 7m27.244s
> sys 0m50.567s
>
> real 1m9.393s
> user 7m27.408s
> sys 0m50.511s
>
> real 1m9.674s
> user 7m28.088s
> sys 0m50.327s
>
> real 1m10.441s
> user 7m27.736s
> sys 0m49.687s
>
> real time
> average : 1m9.76s
> std. dev. : 0.41s
>
> after a reboot with the same kernel :
>
> real 1m8.758s
> user 7m26.012s
> sys 0m48.835s
>
> real 1m11.035s
> user 7m26.432s
> sys 0m49.171s
>
> real 1m9.834s
> user 7m25.768s
> sys 0m49.167s
>
>
> Case 2 : 3/2 nops
>
> real 1m9.713s
> user 7m27.524s
> sys 0m48.315s
>
> real 1m9.481s
> user 7m27.144s
> sys 0m48.587s
>
> real 1m10.565s
> user 7m27.048s
> sys 0m48.715s
>
> real 1m10.008s
> user 7m26.436s
> sys 0m49.295s
>
> real 1m9.982s
> user 7m27.160s
> sys 0m48.667s
>
> real time
> avg : 1m9.95s
> std. dev. : 0.36s
>
>
> Case 3 : 2-bytes jump, offset 0x03
>
> real 1m9.158s
> user 7m27.108s
> sys 0m48.775s
>
> real 1m9.159s
> user 7m27.320s
> sys 0m48.659s
>
> real 1m8.390s
> user 7m27.976s
> sys 0m48.359s
>
> real 1m9.143s
> user 7m26.624s
> sys 0m48.719s
>
> real 1m9.642s
> user 7m26.228s
> sys 0m49.483s
>
> real time
> avg : 1m9.10s
> std. dev. : 0.40s
>
> one extra after reboot with same kernel :
>
> real 1m8.855s
> user 7m27.372s
> sys 0m48.543s
>
>
> Case 4 : 5-bytes jump, offset 0x00
>
> real 1m9.173s
> user 7m27.228s
> sys 0m48.151s
>
> real 1m9.735s
> user 7m26.852s
> sys 0m48.499s
>
> real 1m9.502s
> user 7m27.148s
> sys 0m48.107s
>
> real 1m8.727s
> user 7m27.416s
> sys 0m48.071s
>
> real 1m9.115s
> user 7m26.932s
> sys 0m48.727s
>
> real time
> avg : 1m9.25s
> std. dev. : 0.34s
>
>
> * Test B
>
> Hackbench
>
> Case 1 : ftrace not compiled-in.
>
> ./hackbench 15
> Time: 0.358
> ./hackbench 15
> Time: 0.342
> ./hackbench 15
> Time: 0.354
> ./hackbench 15
> Time: 0.338
> ./hackbench 15
> Time: 0.347
>
> Average : 0.349
> std. dev. : 0.007
>
> Case 2 : 3/2 nops
>
> ./hackbench 15
> Time: 0.328
> ./hackbench 15
> Time: 0.368
> ./hackbench 15
> Time: 0.351
> ./hackbench 15
> Time: 0.343
> ./hackbench 15
> Time: 0.366
>
> Average : 0.351
> std. dev. : 0.014
>
> Case 3 : jmp 2 bytes
>
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.359
> ./hackbench 15
> Time: 0.356
> ./hackbench 15
> Time: 0.350
> ./hackbench 15
> Time: 0.340
>
> Average : 0.350
> std. dev. : 0.007
>
> Case 3 : jmp 5 bytes
>
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.364
> ./hackbench 15
> Time: 0.362
> ./hackbench 15
> Time: 0.338
>
> Average : 0.351
> std. dev. : 0.010
>
>
> Hardware used :
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
> stepping : 6
> cpu MHz : 2000.114
> cache size : 6144 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 4
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 10
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3
> cx16 xtpr dca sse4_1 lahf_lm
> bogomips : 4000.22
> clflush size : 64
> cache_alignment : 64
> address sizes : 38 bits physical, 48 bits virtual
> power management:
>
> (7 other similar cpus)
>
>
> > Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
> >
> > run 1
> > Time: 4.501
> > run 2
> > Time: 4.855
> > run 3
> > Time: 4.198
> > run 4
> > Time: 4.587
> > run 5
> > Time: 5.016
> > run 6
> > Time: 4.757
> > run 7
> > Time: 4.477
> > run 8
> > Time: 4.693
> > run 9
> > Time: 4.710
> > run 10
> > Time: 4.715
> > avg = 4.6509
> >
> >
> > And 10 runs using the above 5 byte nop:
> >
> > run 1
> > Time: 4.832
> > run 2
> > Time: 5.319
> > run 3
> > Time: 5.213
> > run 4
> > Time: 4.830
> > run 5
> > Time: 4.363
> > run 6
> > Time: 4.391
> > run 7
> > Time: 4.772
> > run 8
> > Time: 4.992
> > run 9
> > Time: 4.727
> > run 10
> > Time: 4.825
> > avg = 4.8264
> >
> > # cat /proc/cpuinfo
> > processor : 0
> > vendor_id : AuthenticAMD
> > cpu family : 15
> > model : 65
> > model name : Dual-Core AMD Opteron(tm) Processor 2220
> > stepping : 3
> > cpu MHz : 2799.992
> > cache size : 1024 KB
> > physical id : 0
> > siblings : 2
> > core id : 0
> > cpu cores : 2
> > apicid : 0
> > initial apicid : 0
> > fdiv_bug : no
> > hlt_bug : no
> > f00f_bug : no
> > coma_bug : no
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 1
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> > rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic
> > cr8_legacy
> > bogomips : 5599.98
> > clflush size : 64
> > power management: ts fid vid ttp tm stc
> >
> > There's 4 of these.
> >
> > Just to make sure, I ran the above nop test again:
> >
> > [ this is reverse from the above runs ]
> >
> > run 1
> > Time: 4.723
> > run 2
> > Time: 5.080
> > run 3
> > Time: 4.521
> > run 4
> > Time: 4.841
> > run 5
> > Time: 4.696
> > run 6
> > Time: 4.946
> > run 7
> > Time: 4.754
> > run 8
> > Time: 4.717
> > run 9
> > Time: 4.905
> > run 10
> > Time: 4.814
> > avg = 4.7997
> >
> > And again the two part nop:
> >
> > run 1
> > Time: 4.434
> > run 2
> > Time: 4.496
> > run 3
> > Time: 4.801
> > run 4
> > Time: 4.714
> > run 5
> > Time: 4.631
> > run 6
> > Time: 5.178
> > run 7
> > Time: 4.728
> > run 8
> > Time: 4.920
> > run 9
> > Time: 4.898
> > run 10
> > Time: 4.770
> > avg = 4.757
> >
> >
> > This time it was close, but still seems to have some difference.
> >
> > heh, perhaps it's just noise.
> >
> > -- Steve
> >
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists