lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51890D71.9080600@hp.com>
Date:	Tue, 07 May 2013 10:19:29 -0400
From:	Waiman Long <Waiman.Long@...com>
To:	Ingo Molnar <mingo@...nel.org>
CC:	Arnaldo Carvalho de Melo <acme@...stprotocols.net>,
	Jiri Olsa <jolsa@...hat.com>,
	Stephane Eranian <eranian@...gle.com>,
	Namhyung Kim <namhyung@...nel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Mackerras <paulus@...ba.org>,
	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	"Chandramouleeswaran, Aswin" <aswin@...com>,
	"Norton, Scott J" <scott.norton@...com>,
	Fr??d??ric Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH] perf: fix symbol processing bug and greatly improve performance

On 05/07/2013 03:01 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@...com>  wrote:
>
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>>    %   cumulative   self              self     total
>>   time   seconds   seconds    calls   s/call   s/call  name
>>   96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>>    0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>>    0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>>                  0.00  822.12  192156/192156      map__new [6]
>> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>>                822.10    0.00  192156/192156      dsos__find [8]
>>                  0.01    0.00  192156/192156      dsos__add [62]
>>                  0.01    0.00  192156/192366      dso__new [61]
>>                  0.00    0.00       1/45282525     memdup [31]
>>                  0.00    0.00  192156/192230      dso__set_long_name [91]
>> -----------------------------------------------
>>                822.10    0.00  192156/192156      vdso__dso_findnew [7]
>> [8]     96.9  822.10    0.00  192156         dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
>>
>> To resolve this problem, the dsos__find() function interface was
>> modified to enable searching either the long name or the short
>> name. The vdso__dso_findnew() will now search only the short name
>> while the other call sites search for the long name as before.
>>
>> With this change, the cpu time of perf was reduced from 848.38s to
>> 15.77s and dsos__find() only accounted for 0.06% of the total time.
>>
>>    0.06     15.73     0.01   192151     0.00     0.00  dsos__find
> Very nice!
>
> I noticed that you used gprof to instrument perf itself on a call graph
> level.
>
> Does this method of profiling perf via perf:
>
>    perf record -g perf report
>    perf report
>
> ... produce similarly useful call-graph instrumentation for you?
>
> If not or not quite then could you describe the differences? We could use
> that to further improve perf call-graph profiling.

Thank for the comment.

The slowdown that I was trying to fix was in the "perf record" part of 
the profiling process, not the "perf report" part. I didn't try 
perf-record on perf-record as the performance counters are limited 
resources and I don't want resource conflicts to affect the results. As 
the slow-down was entirely in the user space, I decided to use gprof to 
do the profiling.

One thing I noticed about the gprof instrumentation and perf call-graph 
was that inlined functions were properly identified by gprof, but not 
perf. This makes sense as the compiler produces additional 
instrumentation code and data into the binary that can be used by gprof. 
That is not the case for perf.

I also looked into possible performance bottleneck in the perf-report 
part of the process. Most of the time was spent in sorting and merging 
the call-chain data and there didn't seem to have low hanging fruit for 
performance tuning opportunity.

Please let me know what other tests or data would you like to me to gather.

Best regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ