linux-kernel - Re: [PATCH 2/2] tracing/function-graph-tracer: use the more lightweight local clock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20090305142254.GC27962@elte.hu>
Date:	Thu, 5 Mar 2009 15:22:54 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Frederic Weisbecker <fweisbec@...il.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] tracing/function-graph-tracer: use the more
	lightweight local clock


* Frederic Weisbecker <fweisbec@...il.com> wrote:

> On Thu, Mar 05, 2009 at 12:56:52PM +0100, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@...il.com> wrote:
> > 
> > > On Thu, Mar 05, 2009 at 11:56:52AM +0100, Ingo Molnar wrote:
> > > > 
> > > > * Frederic Weisbecker <fweisbec@...il.com> wrote:
> > > > 
> > > > > On Thu, Mar 05, 2009 at 08:30:13AM +0100, Peter Zijlstra wrote:
> > > > > > On Thu, 2009-03-05 at 02:19 +0100, Frederic Weisbecker wrote:
> > > > > > 
> > > > > > > It takes 1 ms to execute while tracing.
> > > > > > > Considering my frequency is 250 Hz, it means 1/4 of the system is used
> > > > > > > on timer interrupt while tracing.
> > > > > > > 
> > > > > > > For now the hang is fixed, but not the awful latency. And I'm just too frightened
> > > > > > > to test it on 1000 Hz.
> > > > > > > 
> > > > > > > But I plan to add a kind of watchdog to check how many time we spent inside an
> > > > > > > interrupt while graph tracing.
> > > > > > > By checking this time against the current Hz value, I could decide to abort the tracing
> > > > > > > for all irq.
> > > > > > 
> > > > > > That would basically render the thing useless :-(
> > > > > 
> > > > > 
> > > > > It would be only for slow machines :-)
> > > > > I'm talking about something that happened on a Pentium II.
> > > > > 
> > > > >  
> > > > > > Is it specifically function_graph that is so expensive? If so, is that
> > > > > > because of the function exit hook?
> > > > > 
> > > > > 
> > > > > Yes, specifically the function_graph, the function tracer is 
> > > > > not concerned. The function graph tracer takes more than 
> > > > > double overhead compared to the function tracer.
> > > > > 
> > > > > Usually the function tracer hooks directly the the function 
> > > > > that insert the event, it's pretty straightforward.
> > > > > 
> > > > > The function graph does much more work: 
> > > > > 
> > > > > entry: basic checks, take the time, push the infos on the stack, insert an event
> > > > >        on the ring-buffer, hook the return value.
> > > > > return: pop the infos from stack, insert an event on the ring-buffer, jump
> > > > >         to the original caller.
> > > > > 
> > > > > It has a high cost... which makes me sad because I plan to 
> > > > > port it in on Arm and I fear the little Arm boad I recently 
> > > > > purshased will not let me trace the interrupts without 
> > > > > hanging...
> > > > > :-(
> > > > > 
> > > > > I guess I should start thinking on some optimizations, perhaps 
> > > > > using perfcounter?
> > > > 
> > > > yeah. perfcounters and KernelTop might not work on a PII CPU out 
> > > > of box though.
> > > > 
> > > > But hacking perfcounters and looking at perfstat/kerneltop 
> > > > output is serious amount of fun so if you are interested you 
> > > > could try to implement support for it. Do you have any box where 
> > > > perfcounters work? (that would be Core2 Intel boxes or pretty 
> > > > much any AMD box)
> > > > 
> > > > You could have a look at how oprofile works on your box - the 
> > > > code for PII CPUs should be in 
> > > > arch/x86/oprofile/op_model_ppro.c.
> > > > 
> > > > There's also hardcoded support for a single perfcounter in the 
> > > > nmi_watchdog=2 code, in arch/x86/kernel/cpu/perfctr-watchdog.c, 
> > > > for pretty much any x86 CPU that has a PMU.
> > > > 
> > > > Plus there's also the CPU documentation on Intel's site. It's 
> > > > quite well written and pretty well structured. The URL for the 
> > > > CPU's PMU ("Performance Monitoring") should be:
> > > > 
> > > >   http://download.intel.com/design/processor/manuals/253669.pdf
> > > > 
> > > > As a last resort ;-)
> > > > 
> > > > 	Ingo
> > > 
> > > Ah yes, That could be fun!
> > > So, by reading your description, it should work on my labtop I guess?
> > >
> > > -> Intel(R) Pentium(R) Dual  CPU  T2310  @ 1.46GHz
> > 
> > Yeah, should work fine there - so that should be a good 
> > reference point to start off. Let me know if you see any 
> > bugs/problems.
> > 
> > > Anyway, I will give it a try and see what I can do.
> > > Thanks for the pointers.
> > 
> > You are welcome.
> > 
> > 	Ingo
> 
> Ho, that's impressive:
> 
> $ ./perfstat /bin/echo 1
> 1
> 
>  Performance counter stats for '/bin/echo':
> 
>        5.681909  task clock ticks     (msecs)
> 
>               2  CPU migrations       (events)
>               9  context switches     (events)
>             422  pagefaults           (events)
>         4986950  CPU cycles           (events)
>         4870587  instructions         (events)
>           62881  cache references     (events)
>            4882  cache misses         (events)
> 
>  Wall-clock time elapsed:     9.046821 msecs
> 
> So I guess that for it to be useful on fine grained profiling, 
> it's better to include the percounters syscalls inside the 
> application to profile a single function for example?

Correct, it takes 4.8 million instructions to execute a simple 
shell script on Linux ...

For your purposes to profile the function tracer, kerneltop is a 
much more natural choice.

perfstat is best when used for performance/behavioral 
comparisons of a given workload. It gives a flat number for the 
metrics it follows, that does not give you much of an idea about 
exactly where those extra instructions and the extra cost comes 
from.

> I've parsed a bit the intel documentation and perfcounters 
> source code. It looks like the current implementation for 
> intel and amd share enough similar properties so that they 
> have been factorized in a single file.
> 
> This similar property is the use of an msr, while P6 family 
> use the pmc.
> 
> Perhaps I can expand the struct pmc_x86_ops to guess whether 
> we want to use rdmsr or rdpmc.
> 
> Hmm it seems the implementation would be enough different to 
> deserve a new file for P6. Will see...

Yeah, if there's not enough similarities then you can go for a 
separate perf_counters_p6.c file.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/