lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1338797382-287275-1-git-send-email-avagin@openvz.org>
Date:	Mon,  4 Jun 2012 12:09:37 +0400
From:	Andrew Vagin <avagin@...nvz.org>
To:	unlisted-recipients:; (no To-header on input)
Cc:	Arun Sharma <asharma@...com>, Oleg Strikov <OSTRIKOV@...dia.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	linux-kernel@...r.kernel.org
Subject: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

Do you want to know where your code waits locks for a long time?
Yes! It's for you. This feature helps you to find bottlenecks.

It's not artificial task. Once one of my colleague was investigating a
scalability problem. He pressed sysrq-t some times and tried to merge
call-chains by hand. But perf can do that better.

Here is fourth attempt to add this functionality in the kernel.

Previous attempts:
* [PATCH 0/4] trace: add ability to collect call chains of non-current task.

* [PATCH rh6 0/7] Teach perf tool to collect D-state statistics
This version doesn't work for non-root users.

* "sched/tracing: Add a new tracepoint for sleeptime" by Arun Sharma
It has been committed, but then reverted due to performance reason.

This series resolves a problem with events filtering and non-root users.
Before I tried to use sched_stat* events, but it occurs in a context of
another task. For this case logic of perf_event_context* doesn't work.

This series adds one more event sched:sched_switch_finish, it's sent
when a task starts executing on a cpu. When a task goes out from cpu,
sched:sched_switch is sent.

The time difference between sched_switch and sched_switch_finish is a time
when a task wasn't being executed on a cpu.

For profiling sleep time, events of both types should be collected and
filtered by a defined state (interruptible of uninterruptible sleep).

Then events should be post-processed with help "perf inject". It saves the time
of sched_switch event for each process and when it meets stat_switch_finish,
it sets time difference in sample->period and saves this new sample
in the new perf.data.

./perf record -e sched:sched_switch -g --filter 'prev_state == 1' \
	-e sched:sched_switch_finish --filter 'state == 1' -P  ~/foo

./perf inject -i ~/perf.data -o perf.data.new -s

A test program:
#include<unistd.h>
#include<time.h>
#include<sys/select.h>

int main()
{
          struct timespec ts1;
          struct timeval tv1;
          int i;
        long s;


          for (i = 0; i<  2; i++) {
                  ts1.tv_sec = 0;
                  ts1.tv_nsec = 10000000;
                  nanosleep(&ts1, NULL);

                  tv1.tv_sec = 0;
                  tv1.tv_usec = 40000;
                  select(0, NULL, NULL, NULL,&tv1);
                if (i % 100 == 0) {
                        s = time(NULL);
                        while (time(NULL) - s < 4);
                }
          }
        return 1;
}

Profiling results:
# Overhead  Command      Shared Object                  Symbol
# ........  .......  .................  ......................
#
   100.00%      foo  [kernel.kallsyms]  [k] finish_task_switch
                |
                --- finish_task_switch
                    __schedule
                    schedule
                   |          
                   |--79.85%-- schedule_hrtimeout_range_clock
                   |          schedule_hrtimeout_range
                   |          poll_schedule_timeout
                   |          do_select
                   |          core_sys_select
                   |          sys_select
                   |          system_call
                   |          __select
                   |          __libc_start_main
                   |          
                    --20.15%-- do_nanosleep
                              hrtimer_nanosleep
                              sys_nanosleep
                              system_call
                              __GI___libc_nanosleep
                              __libc_start_main

Andrew Vagin (5):
  event: add an event sched_switch_finish
  sched: send the event sched_switch before perf_event_task_sched_out
  sched: save a previous state on task_struct
  perf: teach "perf inject" to work with files
  perf: teach perf inject to merge sched_switch* events

 include/linux/sched.h        |    3 +
 include/trace/events/sched.h |   29 ++++++++++
 kernel/sched/core.c          |    3 +-
 tools/perf/builtin-inject.c  |  119 +++++++++++++++++++++++++++++++++++++++---
 4 files changed, 146 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ