linux-kernel - Re: [PATCH 3/3] perf events: add timehist option to record and report

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110218192433.GA5658@nowhere>
Date:	Fri, 18 Feb 2011 20:24:35 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	David Ahern <daahern@...co.com>
Cc:	Ingo Molnar <mingo@...e.hu>, linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org, peterz@...radead.org,
	acme@...stprotocols.net, paulus@...ba.org,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [PATCH 3/3] perf events: add timehist option to record and
 report

On Fri, Feb 18, 2011 at 11:07:32AM -0700, David Ahern wrote:
> 
> 
> On 02/18/11 10:59, Frederic Weisbecker wrote:
> > On Fri, Feb 18, 2011 at 07:28:20AM -0700, David Ahern wrote:
> >>
> >>
> >> On 02/18/11 00:06, Ingo Molnar wrote:
> >>>
> >>> * David Ahern <daahern@...co.com> wrote:
> >>>
> >>>> +--timehist::
> >>>> +        Generate time history output. This shows each sample with a wall-clock
> >>>> +        timestamp and address to symbol conversions. The samples are output in
> >>>> +        the same order they exist in the perf.data file. The --timehist option
> >>>> +        must be used with the record to get wall-clock timestamps.
> >>>
> >>> Displaying samples this way is very useful, but how is this different from trace 
> >>> output - i.e. 'perf script' output?
> >>>
> >>> There's a 'live logging mode' feature:
> >>>
> >>>     perf record ./myworkload | perf inject -v -b | perf script -i -
> >>>
> >>> ( There's also some work going on to have a separate 'perf trace' utility that will
> >>>   output a linear trace of whatever events in perf.data are. )
> >>>
> >>> So unless i'm missing something it would be more useful to extend 'perf script' (or 
> >>> the upcoming 'perf trace') to list regular samples in a perf.data (i.e. not rely on 
> >>> -R raw trace data alone), and of course allow a straightforward utilization of real 
> >>> timestamps, when available.
> >>
> >> To date I have not seen any option to show samples with wall-clock
> >> timestamps. In a sense it pretty-prints the -R format. This allows perf
> >> samples to be compared to other log files / sources of data to resolve
> >> an issue.
> >>
> >> For example:
> >>
> >>  Time-of-Day   Kernel Timestamp cpu    Event  Command   PID
> >> IP     Symbol Name (DSO Name)
> >> 22:40:13.831567     858005051634   1  context-switches  kworker/1:0 8
> >> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
> >>
> >> 22:40:13.831683     858005168139   1  context-switches  sshd 897
> >> ffffffff8103251e perf_event_task_sched_out ([kernel.kallsyms])
> >>
> >> With the stack callchain option it dumps a line for each address in the
> >> stack (columns abbreviated here for email):
> >>
> >> 07:24:37.527728  1  sshd 819  ffffffff8103251e perf_event_task_sched_out
> >> ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff81377231 schedule ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff81377f3a
> >> schedule_hrtimeout_range_clock ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff81378016 schedule_hrtimeout_range
> >> ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff81103e15 poll_schedule_timeout
> >> ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff8110463c do_select ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff811047f4 core_sys_select
> >> ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff81104929 sys_select
> >> ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819  ffffffff81002bc2 system_call
> >> ([kernel.kallsyms])
> >> 07:24:37.527728  1  sshd 819      7f56d4814e93 __GI_select
> >> (/lib64/libc-2.12.90.so)
> >> 07:24:37.527728  1  sshd 819            40560b run_builtin
> >> (/tmp/build-perf/perf)
> >> 07:24:37.527728  1  sshd 819            40580d handle_internal_command
> >> (/tmp/build-perf/perf)
> >> 07:24:37.527728  1  sshd 819            405959 run_argv
> >> (/tmp/build-perf/perf)
> >> 07:24:37.527728  1  sshd 819            405b63 main (/tmp/build-perf/perf)
> >> 07:24:37.527728  1  sshd 819      7fa49de9fe7d __libc_start_main
> >> (/lib64/libc-2.12.90.so)
> >>
> >>
> >> This format has allowed us to do an in-depth analysis of scheduling
> >> (record -e cs -c 1): we can see when was process running, for how long
> >> it ran, why it was kicked off the processor, how long was it off the
> >> processor, what ran in between schedulings, etc. If a log file shows a
> >> process missing a heartbeat you can look at the perf data in the time
> >> range and see if it had a chance to run. If not where was it blocked.
> > 
> > Oh, in this case I strongly suggest you to use the tracepoint events.
> > 
> > Look at the output of "perf list" and seek the events which name
> > start with "sched:"
> > 
> >   sched:sched_kthread_stop                   [Tracepoint event]
> >   sched:sched_kthread_stop_ret               [Tracepoint event]
> >   sched:sched_wait_task                      [Tracepoint event]
> >   sched:sched_wakeup                         [Tracepoint event]
> >   sched:sched_wakeup_new                     [Tracepoint event]
> >   sched:sched_switch                         [Tracepoint event]
> >   sched:sched_migrate_task                   [Tracepoint event]
> >   sched:sched_process_free                   [Tracepoint event]
> >   sched:sched_process_exit                   [Tracepoint event]
> >   sched:sched_process_wait                   [Tracepoint event]
> >   sched:sched_process_fork                   [Tracepoint event]
> >   sched:sched_stat_wait                      [Tracepoint event]
> >   sched:sched_stat_sleep                     [Tracepoint event]
> >   sched:sched_stat_iowait                    [Tracepoint event]
> >   sched:sched_stat_runtime                   [Tracepoint event]
> > 
> > 
> > You have the sched:sched_switch event and many others.
> > 
> > Just try:
> > 
> > perf record -a -e sched:*
> > perf script
> > 
> >             perf-4128  [000] 19242.870025: sched_stat_runtime: comm=perf pid=4128 runtime=7430405 [ns] vruntime=3530192223488 
> >             perf-4128  [000] 19242.870042: sched_stat_runtime: comm=perf pid=4128 runtime=23142 [ns] vruntime=3530192246630 [n
> >             perf-4128  [000] 19242.870045: sched_stat_sleep: comm=kondemand/0 pid=59 delay=9979163 [ns]
> >             perf-4128  [000] 19242.870048: sched_wakeup: comm=kondemand/0 pid=59 prio=120 success=1 target_cpu=000
> >             perf-4128  [000] 19242.870063: sched_stat_runtime: comm=perf pid=4128 runtime=21581 [ns] vruntime=3530192268211 [n
> >             perf-4128  [000] 19242.870066: sched_stat_wait: comm=kondemand/0 pid=59 delay=21581 [ns]
> >             perf-4128  [000] 19242.870069: sched_switch: prev_comm=perf prev_pid=4128 prev_prio=120 prev_state=R ==> next_comm
> >      kondemand/0-59    [000] 19242.870091: sched_stat_runtime: comm=kondemand/0 pid=59 runtime=27362 [ns] vruntime=35301862739
> >      kondemand/0-59    [000] 19242.870094: sched_stat_wait: comm=perf pid=4128 delay=27362 [ns]
> >      kondemand/0-59    [000] 19242.870095: sched_switch: prev_comm=kondemand/0 prev_pid=59 prev_prio=120 prev_state=S ==> next
> > 
> > And you can run your own script on these events:
> > 
> > $ sudo ./perf script -g python
> > generated Python script: perf-script.py
> > 
> > Edit perf-script.py and then run it:
> > 
> > $ perf script -s ./perf-script.py
> > 
> > That also works for perl.
> > 
> > The timestamps will be the cpu time and not the walltime, but at least that seems
> > to be partly what you seek?
> 
> Yes, I have looked at that output before -- essentially ftrace output.
> 
> We want not only context-switch events, but the stack trace at the
> switch. For example, with the stack trace you can see preemption -- when
> a process got booted by another and where the booted process is at the
> time. You can see not only which system call caused an ap to block
> (e.g., an ioctl) but the full code path leading to the block.

You can recognize preemption a the context switch tracepoint: if the state
of the scheduled out task is R (TASK_RUNNING), which means it doesn't go
to sleep but gets preempted, with an explicit preemption point like cond_resched(),
or a more implicit one: spin_unlock(), preempt_enable(), an interrupt, ...
Or it has been woken up while it was about to sleep, but it doesn't make much
difference.

If you want to know when a process is booted by another you can use the
fork tracepoint, or sched:wake_up_new, etc...

And you can use syscall tracepoints to get the syscalls you want.

I don't see much the point for you to use stacktraces. But if you
do, then rather add this support to perf script, in our scripting
framework.

Because what you've done is basically to add tracing support to
perf report. But we have perf script for that already. It only focuses
on tracepoint events but they are those people are interested in
because they show logical events in the kernel. I guess
people are not interested in cpu-cycles overflows events or so as
they don't show a state change in the kernel.

Well, yeah I can understand if one considers the software events,
that makes meaningful events from the kernel. But these software events
support have been a mistake in perf. You should rather use the
tracepoint events instead.
 
> That data along with the gettimeofday timestamp has allowed us to
> resolve performance issues such as a system call taking longer than
> expected during a specific sequence of events or a process getting
> preempted and not scheduled for N seconds. etc., etc.

That's about the same here. If you really need this, you need to add
the support in perf script to handle that on tracepoint events.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/