lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <1410522334.16936.42.camel@hornet> Date: Fri, 12 Sep 2014 12:45:34 +0100 From: Pawel Moll <pawel.moll@....com> To: Richard Cochran <richardcochran@...il.com> Cc: Steven Rostedt <rostedt@...dmis.org>, Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Oleg Nesterov <oleg@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>, Mel Gorman <mgorman@...e.de>, Andy Lutomirski <luto@...capital.net>, John Stultz <john.stultz@...aro.org>, Stephen Boyd <sboyd@...eaurora.org>, Baruch Siach <baruch@...s.co.il>, Thomas Gleixner <tglx@...utronix.de>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org> Subject: Re: [RFC] sched_clock: Track monotonic raw clock On Tue, 2014-07-22 at 20:39 +0100, Richard Cochran wrote: > So the only reasonable way, IMHO, is to simply provide the needed > information in the traces, and then add some user space code to find > the relationship between the perf clock and the system clock. The > great advantage of this approach is that you don't have to create a > perfect servo from day one. The time stamps can be re-analyzed at any > time after the fact. Ok, so I had yet another (already lost count ;-) look into this... I came up with a conclusion that the simplest way it can be done is to have a periodically repeated events marked with both "kernel timestamp" and "user timestamp". And a scheduler switch is such a periodic event. Moreover, it seems quite natural: a user process is scheduled in (take a snapshot of sched clock and monotonic raw at this moment, let's call them t_s_in and t_m_in), the process "generates" the "user timestamp" by clock_gettime(CLOCK_MONOTONIC_RAW, t_m_x), the process is scheduled out (again, take sched and monotonic timestamps then, t_s_out and t_m_out). The, after extracting all the t_?_? from the trace stream, a translation (we've know the monotonic raw value t_m_x, we're looking for its sched clock equivalent t_s_x) should simply get down to: t_s_x = t_s_in + ((t_m_x - t_m_1) * (t_s_out - t_s_in)) / (t_m_out - t_m_in) I quickly made an experiment, by adding a customised variant of sched_switch trace event (note: I'm *not* proposing it to be merged) 8<------------- >From ad772fca06da4774e6dd3a116ee2424a0dd3d80f Mon Sep 17 00:00:00 2001 From: Pawel Moll <pawel.moll@....com> Date: Fri, 22 Aug 2014 15:23:13 +0100 Subject: [PATCH] trace: ched_switch event variant with raw monotonic timestamp This patch adds a variant of sched_switch trace event, adding a field for a monotonic raw timestamp. Signed-off-by: Pawel Moll <pawel.moll@....com> --- include/trace/events/sched.h | 48 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 1 + 2 files changed, 49 insertions(+) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 0a68d5a..1429958 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -7,6 +7,7 @@ #include <linux/sched.h> #include <linux/tracepoint.h> #include <linux/binfmts.h> +#include <linux/time.h> /* * Tracepoint for calling kthread_stop, performed to end a kthread: @@ -150,6 +151,53 @@ TRACE_EVENT(sched_switch, ); /* + * Tracepoint for task switches, including raw monotonic raw clock + * timestamp, to perform user- to kernelspace time correlation + */ +TRACE_EVENT(sched_switch_raw_monotonic, + + TP_PROTO(struct task_struct *prev, + struct task_struct *next), + + TP_ARGS(prev, next), + + TP_STRUCT__entry( + __field( u64, raw_monotonic ) + __array( char, prev_comm, TASK_COMM_LEN ) + __field( pid_t, prev_pid ) + __field( int, prev_prio ) + __field( long, prev_state ) + __array( char, next_comm, TASK_COMM_LEN ) + __field( pid_t, next_pid ) + __field( int, next_prio ) + ), + + TP_fast_assign( + struct timespec now; + getrawmonotonic(&now); + __entry->raw_monotonic = timespec_to_ns(&now); + memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); + __entry->prev_pid = prev->pid; + __entry->prev_prio = prev->prio; + __entry->prev_state = __trace_sched_switch_state(prev); + memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); + __entry->next_pid = next->pid; + __entry->next_prio = next->prio; + ), + + TP_printk("raw_monotonic=%Lu prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", + __entry->raw_monotonic, __entry->prev_comm, + __entry->prev_pid, __entry->prev_prio, + __entry->prev_state & (TASK_STATE_MAX-1) ? + __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|", + { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, + { 16, "Z" }, { 32, "X" }, { 64, "x" }, + { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", + __entry->prev_state & TASK_STATE_MAX ? "+" : "", + __entry->next_comm, __entry->next_pid, __entry->next_prio) +); + +/* * Tracepoint for a task being migrated: */ TRACE_EVENT(sched_migrate_task, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ec1a286..0cc7a92 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2179,6 +2179,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { trace_sched_switch(prev, next); + trace_sched_switch_raw_monotonic(prev, next); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); fire_sched_out_preempt_notifiers(prev, next); -- 1.9.1 8<------------- And than did some experiment, opening a perf session with this event and executing the following test: clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_1); write(marker, "test", 4); clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_2); where "marker" is a descriptor for "/sys/kernel/debug/tracing/trace_marker". In other words, I generated a unique event which will be timestamped by sched_clock. Then I averaged t_m_1 and t_m_2 to get t_m_x, translated it into t_s_x with the formula above and compared it with a timestamp for the marked, grepped from the "/sys/kernel/debug/tracing/trace" (as it is directly injected into the ftrace buffer, there is no way of getting it into perf data stream). On average I was getting results accurate to or below 1 microsecond (that's the resolution of a timestamp generated by standard ftrace output). So, the approach works in general, however: 1. I guess adding monotonic raw field to the "vanilla" sched_switch won't be praised... A clone like the above one doesn't make much sense either, I think. 2. Even if this was possible (or we had a clone like the above), only root can request trace events. This is acceptable for system-wide profiling tools (which will run as root by definition), but can be a problem for JIT engines, which usually run as unprivileged processes. 3. The accuracy, although not bad and sufficient in a lot of use cases (eg. latencies when accessing hwmon sensors are big and unpredictable by their nature, so a microsecond here and there don't matter at all), may not be good enough for JIT engines - there must be a way of precisely marking a moment between a new code (and debug information) being generated and its first instruction executed. "trace_marker" could be used for this, but as mentioned above, it's "invisible" to perf, and even if it was, it would be root-only, see p.2. With all this in mind, I though it would be possible to generalise the "monotonic timestamping" at the perf interface level and duplicating the marker idea there... A short RFC series to follow. Pawel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists