linux-kernel - Re: callchain sampling bug in perf?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100821024713.GD7959@nowhere>
Date:	Sat, 21 Aug 2010 04:47:14 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Christoph Hellwig <hch@...radead.org>
Cc:	Arnaldo Carvalho de Melo <acme@...radead.org>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: callchain sampling bug in perf?

On Fri, Aug 20, 2010 at 05:16:45AM -0400, Christoph Hellwig wrote:
> On Thu, Aug 19, 2010 at 12:04:22PM -0300, Arnaldo Carvalho de Melo wrote:
> > 
> > [acme@...pio tmp]$ perf report
> > 
> > http://vger.kernel.org/~acme/perf/perf-report-tui-callchain-xlog_sync.png
> > 
> > So it seems to work (you tell me if the callchains make sense), and the problem
> 
> I've rebuilt perf with libnewt to reproduce it, but to get any of the
> callchain data I need to call perf report with a -g argument (which is
> rather expected from the documentation anyway)
> 
> I still see the same problems as with the TUI perf report with that.
> With the -g {mode},0.0 there is nothing to expand inside the GUI for
  ^^
  you meant "without", right?

> e.g. the pythong process, and with the 0.0 threshold I can only expand
> a few 0.<something> callchains, but I never see the 80% your screenshot
> shows.  What perf version are you running?



Right. So this is due to the roots of the callchains that have uninitialized
values for their children hits.

To summarize, you enter a percentage (or you keep the default 0.5%) and it gets
multiplied by a random number. The result is supposed to be the minimum hits
a branch should have to be displayed. But it often produced a huge number due
to the unintialization randomness. Hence some callchains were not displayed.

The following patch solves that particular problem, I'll push it to Ingo for .36
(and -stable):

diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 624a96c..d1349d0 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -51,6 +51,7 @@ static inline void callchain_init(struct callchain_node *node)
 	INIT_LIST_HEAD(&node->val);
 
 	node->parent = NULL;
+	node->children_hit = 0;
 	node->hit = 0;
 }
 



> Also the flat mode is rendered incorrectly, it just adds different call
> graphs inside a single process directly after each other instead of
> separating them in the rendering.


I'm not sure what you mean. The flat format is a pure dump of every callchains
that belong to a single process (or whatever kind of histogram source...).

What do you mean by separating them in the rendering?

OTOH I see another bug:

    61.70%        python  [kernel.kallsyms]  [k] xlog_sync
             0.03%
                xlog_sync
                xlog_write
                xlog_cil_push
                _xfs_log_force
                xfs_log_force
                xfs_commit_dummy_trans
                xfs_quiesce_data
                xfs_fs_sync_fs
                __sync_filesystem
                sync_one_sb
                iterate_supers
                sync_filesystems
                sys_sync
                system_call_fastpath
                0x334f0d72e7

             0.01%
                xlog_sync
                _xfs_log_force
                xfs_log_force
                xfs_commit_dummy_trans
                xfs_quiesce_data
                xfs_fs_sync_fs
                __sync_filesystem
                sync_one_sb
                iterate_supers
                sync_filesystems
                sys_sync
                system_call_fastpath
                0x334f0d72e7

I think that's what you said previously: they sometimes don't add up correctly.
The total of hit percentages from callchains should equal the total
percentage of the process they belong to.

I'm investigating to find the reason.

Thanks a lot of your reports!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/