linux-kernel - Re: [PATCH 34/37] perf hists browser: Support flat callchains

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20151125210357.GA13940@danjae.kornet>
Date:	Thu, 26 Nov 2015 06:03:57 +0900
From:	Namhyung Kim <namhyung@...nel.org>
To:	Arnaldo Carvalho de Melo <arnaldo.melo@...il.com>
Cc:	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
	Andi Kleen <andi@...stfloor.org>,
	David Ahern <dsahern@...il.com>, Jiri Olsa <jolsa@...hat.com>,
	Kan Liang <kan.liang@...el.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Hi Arnaldo,

On Tue, Nov 24, 2015 at 11:10:25PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Nov 24, 2015 at 10:34:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Wed, Nov 25, 2015 at 10:26:08AM +0900, Namhyung Kim escreveu:
> > > On Tue, Nov 24, 2015 at 12:45:51PM -0200, Arnaldo Carvalho de Melo wrote:
> > > > Em Tue, Nov 24, 2015 at 02:27:08PM +0900, Namhyung Kim escreveu:
> > > > > On Mon, Nov 23, 2015 at 04:16:48PM +0100, Frederic Weisbecker wrote:
> > > > > > On Thu, Nov 19, 2015 at 02:53:20PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > > > > From: Namhyung Kim <namhyung@...nel.org>
> > > > > > [...]
> > > > > Thus I simply copied callchain lists in parents to leaf nodes.  Yes,
> > > > > it will consume some memory but can simplify the code.
> > > > 
> > > > I haven't done any measuring, but I'm noticing that 'perf top -g' is
> > > > showing more warnings about not being able to process events fast enough
> > > > and so ends up losing events, I tried with --max-stack 16 and it helped,
> > > > this is just a heads up.
> > > 
> > > OK, but it seems that it's not related to this patch since this patch
> > > only affects flat or folded callchain mode.
> > 
> > Well, doesn't this patch makes some of the involved data structures
> > larger, thus putting more pressure on the L1 cache, etc? It may well be
> > related, but we need to measure.
> > 
> > > > Perhaps my workstation workloads are gettning deeper callchains over
> > > > time, but perhaps this is the cost of processing callchains that is
> > > > increasing, I need to stop and try to quantify this.
> > > > 
> > > > We really need to look at reducing the overhead of processing
> > > > callchains.
> > > 
> > > Right, but with my multi-thread work, I realized that perf is getting
> > > heavier recently.  I guess it's mostly due to the atomic refcount
> > > work.  I need to get back to the multi-thread work..
> > 
> > We really need to measure this ;-)
> 
> So, something strange, if I use:
> 
> [acme@zoo linux]$ cat ~/bin/allmod 
> rm -rf ../build/allmodconfig/ ; mkdir ../build/allmodconfig/ ; make O=../build/allmodconfig/ allmodconfig ; make -j32 O=../build/allmodconfig
> [acme@zoo linux]$
> 
> To generate background load, I don't see that much this:
> 
> +    8.55%     8.49%  libc-2┌─Warning!──────────────────────────────────────────────┐ ▒
> +    7.08%     6.98%  perf  │Events are being lost, check IO/CPU overload!          │ ▒
> +    6.84%     0.04%  perf  │                                                       │ ▒
> +    6.01%     0.09%  perf  │You may want to run 'perf' using a RT scheduler policy:│ ▒
> +    5.26%     0.13%  [kerne│                                                       │t▒
> +    4.96%     1.50%  perf  │ perf top -r 80                                        │ ▒
> +    4.76%     3.58%  perf  │                                                       │ ▒
> +    4.69%     0.05%  perf  │Or reduce the sampling frequency.                      │ ▒
> 
> Its with a low loadavg:
> 
> [acme@zoo linux]$ cat /proc/loadavg 
> 0.75 0.79 0.64 3/549 21259
> 
> That it pops up :-\
> 
> If I take a snapshot with 'P'
> 
> # head -40 perf.hist.0
> +   21.43%    21.09%  libc-2.20.so          [.] _int_malloc
> +   19.49%     0.00%  perf                  [.] cmd_top
> +   19.46%     0.02%  perf                  [.] perf_top__mmap_read_idx
> +   19.03%     0.06%  perf                  [.] hist_entry_iter__add
> +   16.46%     1.85%  perf                  [.] iter_add_next_cumulative_entry
> +   12.09%    11.98%  libc-2.20.so          [.] free
> +   10.68%    10.61%  libc-2.20.so          [.] __libc_calloc
> +    9.61%     0.09%  perf                  [.] hists__decay_entries
> +    8.92%     8.85%  libc-2.20.so          [.] malloc_consolidate
> +    8.17%     6.33%  perf                  [.] free_callchain_node
> +    7.94%     0.09%  perf                  [.] hist_entry__delete
> +    6.22%     0.03%  perf                  [.] callchain_append
> +    6.20%     6.11%  perf                  [.] append_chain_children
> +    5.44%     1.50%  perf                  [.] __hists__add_entry
> +    4.34%     0.14%  [kernel]              [k] entry_SYSCALL_64_fastpath
> +    3.69%     3.67%  perf                  [.] sort__dso_cmp
> +    3.12%     0.20%  perf                  [.] hists__output_resort
> +    2.88%     0.00%  [unknown]             [.] 0x6d86258d4c544155
> +    2.88%     0.00%  libc-2.20.so          [.] __libc_start_main
> +    2.88%     0.00%  perf                  [.] main
> +    2.88%     0.00%  perf                  [.] run_builtin
> +    2.66%     0.00%  libpthread-2.20.so    [.] start_thread
> +    2.66%     0.00%  perf                  [.] display_thread_tui
> +    2.66%     0.00%  perf                  [.] perf_evlist__tui_browse_hists
> +    2.66%     0.00%  perf                  [.] perf_evsel__hists_browse
> +    2.49%     0.07%  [kernel]              [k] sys_futex
> +    2.42%     0.06%  [kernel]              [k] do_futex
>      2.24%     0.00%  perf                  [.] perf_top__sort_new_samples
> +    1.92%     0.51%  perf                  [.] hists__collapse_resort
> +    1.87%     1.86%  perf                  [.] hpp__sort_overhead_acc
> +    1.69%     0.09%  libc-2.20.so          [.] __lll_unlock_wake_private
> +    1.45%     1.44%  perf                  [.] hpp__nop_cmp
>      1.45%     1.43%  perf                  [.] rb_erase
> +    1.44%     1.42%  perf                  [.] __sort__hpp_cmp
>      1.31%     0.16%  libc-2.20.so          [.] __lll_lock_wait_private
>      1.18%     0.19%  [kernel]              [k] futex_wake
> +    1.13%     1.13%  perf                  [.] sort__sym_cmp
>      1.11%     0.02%  [kernel]              [k] schedule
>      1.09%     0.22%  [kernel]              [k] __schedule
>      0.99%     0.08%  [kernel]              [k] futex_wait
> 
> So its quite a lot of mallocs, i.e. just do a 'perf top -g' and wait a
> bit, malloc goes on bubbling up to the top, about the same time it
> starts showing that popup screen telling that we're losing events.
> 
> If I use --no-children, to see if there is a difference, using either
> --call-graph caller or callee, it doesn't get more than about 1%.
> 
> Ok, now I tried with "perf top --call-graph caller" i.e. with
> --children, and looked at the _int_malloc callchains I get really long,
> bogus callchains, see below. That explains why I don't lose events when
> I use --max-stack.
> 
> I'll have to stop now, and I put the full perf.hist.1 at
> http://vger.kernel.org/~acme/perf/perf.hist.1.xz
> 
> - Arnaldo
> 
> [root@zoo ~]# head -60 perf.hist.1
> -   17.92%    17.10%  libc-2.20.so                        [.] _int_malloc
>    + 112.80% _int_malloc
>      11.14% 0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068

Hmm.. we should cut off the loop in the broken callchains.  And it'd
be good to apply --hide-unresolved to callchains as well.  I'll take a
look at it.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/