lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 19 Apr 2009 16:38:49 -0700
From:	Jeremy Fitzhardinge <jeremy@...p.org>
To:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
CC:	Steven Rostedt <rostedt@...dmis.org>, Ingo Molnar <mingo@...e.hu>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@...rix.com>,
	Christoph Hellwig <hch@....de>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] tracing: move __DO_TRACE out of line

Mathieu Desnoyers wrote:
> Here is the conclusions I gather from the following tbench tests on the LTTng
> tree :
>
> - Dormant tracepoints, when sprinkled all over the place, have a very small, but
>   measurable, footprint on kernel stress-test workloads (3 % for the
>   whole 2.6.30-rc1 LTTng tree).
>
> - "Immediate values" help lessening this impact significantly (3 % -> 2.5 %).
>
> - Static jump patching would diminish impact even more, but would require gcc
>   modifications to be acceptable. I did some prototypes using instruction
>   pattern matching in the past which was judged too complex.
>
> - I strongly recommend adding per-subsystem config-out option for heavy
>   users like kmemtrace or pvops. Compiling-out kmemtrace instrumentation
>   brings the performance impact from 2.5 % down to 1.9 % slowdown.
>
> - Putting the tracepoint out-of-line is a no-go, as it slows down *both* the
>   dormant (3 % -> 4.7 %) and the active (+20% to tracer overhead) tracepoints
>   compared to inline tracepoints.
>   

That's an interestingly counter-intuitive result.  Do you have any 
theories how this might happen?  The only mechanism I can think of is 
that, because the inline code sections are smaller, gcc is less inclined 
to put the if(unlikely) code out of line, so the amount of hot-patch 
code is higher.  But still, 1.7% is a massive increase in overhead, 
especially compared to the relative differences of the other changes.

> Tracepoints all compiled-out :
>
> run 1 :                2091.50
> run 2 (after reboot) : 2089.50 (baseline)
> run 3 (after reboot) : 2083.61
>
> Dormant tracepoints :
>
> inline, no immediate value optimization
>
> run 1 :                1990.63
> run 2 (after reboot) : 2025.38 (3 %)
> run 3 (after reboot) : 2028.81
>
> out-of-line, no immediate value optimization
>
> run 1 :                1990.66
> run 2 (after reboot) : 1990.19 (4.7 %)
> run 3 (after reboot) : 1977.79
>
> inline, immediate value optimization
>
> run 1 :                2035.99 (2.5 %)
> run 2 (after reboot) : 2036.11
> run 3 (after reboot) : 2035.75
>
> inline, immediate value optimization, configuring out kmemtrace tracepoints
>
> run 1 :                2048.08 (1.9 %)
> run 2 (after reboot) : 2055.53
> run 3 (after reboot) : 2046.49
>   

So what are you doing here?  Are you doing 3 runs, then comparing he 
median measurement in each case?

The trouble is that your run to run variations are at least as large as 
the difference you're trying to detect.  For example in run 1 of 
"inline, no immediate value optimization" you got 1990.6MB/s throughput, 
and then runs 2 & 3 both went up to ~2025.  Why?  That's a huge jump.

The "out-of-line, no immediate value optimization" runs 1&2 has the same 
throughput as run 1 of the previous test, 1990MB/s, while run 3 is a bit 
worse.  OK, so perhaps its slower.  But why are runs 1&2 more or less 
identical to inline/run1?

What would happen if you happened to do 10 iterations of these tests?  
There just seems like too much run to run variation to make 3 runs 
statistically meaningful.

I'm not picking on you personally, because I had exactly the same 
problems when trying to benchmark the overhead of pvops.  The 
reboot/rerun variations were at least as large as the effects I'm trying 
to measure, and I'm just feeling suspicious of all the results.

I think there's something fundimentally off about about this kind of 
kernel benchmark methodology.  The results are not stable and are not - 
I think - reliable.  Unfortunately I don't have enough of a background 
in statistics to really analyze what's going on here, or how we should 
change the test/measurement methodology to get results that we can 
really stand by.

I don't even have a good explanation for why there are such large 
boot-to-boot variations anyway.  The normal explanation is "cache 
effects", but what is actually changing here?  The kernel image is 
identical, loaded into the same physical pages each time, and mapped 
into the same virtual address.  So the I&D caches and tlb should get 
exactly the same access patterns for the kernel code itself.  The 
dynamically allocated memory is going to vary, and have different cache 
interactions, but is that enough to explain these kinds of variations?  
If so, we're going to need to do a lot more iterations to see any signal 
from our actual changes over the noise that "cache effects" are throwing 
our way...

    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ