lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070601155413.GA1216@Krystal>
Date:	Fri, 1 Jun 2007 11:54:13 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* Andrew Morton (akpm@...ux-foundation.org) wrote:
> On Wed, 30 May 2007 10:00:34 -0400
> Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:
> 
> > @@ -2990,7 +2991,8 @@
> >  			print_irqtrace_events(prev);
> >  		dump_stack();
> >  	}
> > -	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > +	cond_call(profile_on,
> > +		profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
> >  
> 
> That's looking pretty neat.  Do you have any before-and-after performance
> figures for i386 and for a non-optimised architecture?

Sure, here is the result of a small test comparing:
1 - Branch depending on a cache miss (has to fetch in memory, caused by a 128
    bytes stride)). This is the test that is likely to look like what
    side-effect the original profile_hit code was causing, under the
    assumption that the kernel is already using L1 and L2 caches at
    their full capacity and that a supplementary data load would cause
    cache trashing.
2 - Branch depending on L1 cache hit. Just for comparison.
3 - Branch depending on a load immediate in the instruction stream.

It has been compiled with gcc -O2. Tests done on a 3GHz P4.

In the first test series, the branch is not taken:

number of tests : 1000
number of branches per test : 81920
memory hit cycles per iteration (mean) : 48.252
L1 cache hit cycles per iteration (mean) : 16.1693
instruction stream based test, cycles per iteration (mean) : 16.0432


In the second test series, the branch is taken and an integer is
incremented within the block:

number of tests : 1000
number of branches per test : 81920
memory hit cycles per iteration (mean) : 48.2691
L1 cache hit cycles per iteration (mean) : 16.396
instruction stream based test, cycles per iteration (mean) : 16.0441

Therefore, the memory fetch based test seems to be 200% slower than the
load immediate based test.

(I am adding these results to the documentation)

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ