linux-kernel - Re: [RFC] sched_clock: Track monotonic raw clock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1410522334.16936.42.camel@hornet>
Date:	Fri, 12 Sep 2014 12:45:34 +0100
From:	Pawel Moll <pawel.moll@....com>
To:	Richard Cochran <richardcochran@...il.com>
Cc:	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>,
	Andy Lutomirski <luto@...capital.net>,
	John Stultz <john.stultz@...aro.org>,
	Stephen Boyd <sboyd@...eaurora.org>,
	Baruch Siach <baruch@...s.co.il>,
	Thomas Gleixner <tglx@...utronix.de>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] sched_clock: Track monotonic raw clock

On Tue, 2014-07-22 at 20:39 +0100, Richard Cochran wrote:
> So the only reasonable way, IMHO, is to simply provide the needed
> information in the traces, and then add some user space code to find
> the relationship between the perf clock and the system clock. The
> great advantage of this approach is that you don't have to create a
> perfect servo from day one. The time stamps can be re-analyzed at any
> time after the fact.

Ok, so I had yet another (already lost count ;-) look into this...

I came up with a conclusion that the simplest way it can be done is to
have a periodically repeated events marked with both "kernel timestamp"
and "user timestamp". And a scheduler switch is such a periodic event.
Moreover, it seems quite natural: a user process is scheduled in (take a
snapshot of sched clock and monotonic raw at this moment, let's call
them t_s_in and t_m_in), the process "generates" the "user timestamp" by
clock_gettime(CLOCK_MONOTONIC_RAW, t_m_x), the process is scheduled out
(again, take sched and monotonic timestamps then, t_s_out and t_m_out).

The, after extracting all the t_?_? from the trace stream, a translation
(we've know the monotonic raw value t_m_x, we're looking for its sched
clock equivalent t_s_x) should simply get down to:

t_s_x = t_s_in + ((t_m_x - t_m_1) * (t_s_out - t_s_in)) / (t_m_out - t_m_in)

I quickly made an experiment, by adding a customised variant of
sched_switch trace event (note: I'm *not* proposing it to be merged)

8<-------------
>From ad772fca06da4774e6dd3a116ee2424a0dd3d80f Mon Sep 17 00:00:00 2001
From: Pawel Moll <pawel.moll@....com>
Date: Fri, 22 Aug 2014 15:23:13 +0100
Subject: [PATCH] trace: ched_switch event variant with raw monotonic timestamp

This patch adds a variant of sched_switch trace event,
adding a field for a monotonic raw timestamp.

Signed-off-by: Pawel Moll <pawel.moll@....com>
---
 include/trace/events/sched.h | 48 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |  1 +
 2 files changed, 49 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5a..1429958 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,6 +7,7 @@
 #include <linux/sched.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
+#include <linux/time.h>
 
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
@@ -150,6 +151,53 @@ TRACE_EVENT(sched_switch,
 );
 
 /*
+ * Tracepoint for task switches, including raw monotonic raw clock
+ * timestamp, to perform user- to kernelspace time correlation
+ */
+TRACE_EVENT(sched_switch_raw_monotonic,
+
+	TP_PROTO(struct task_struct *prev,
+		 struct task_struct *next),
+
+	TP_ARGS(prev, next),
+
+	TP_STRUCT__entry(
+		__field(	u64,	raw_monotonic			)
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	prev_pid			)
+		__field(	int,	prev_prio			)
+		__field(	long,	prev_state			)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	next_pid			)
+		__field(	int,	next_prio			)
+	),
+
+	TP_fast_assign(
+		struct timespec now;
+		getrawmonotonic(&now);
+		__entry->raw_monotonic = timespec_to_ns(&now);
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		__entry->prev_pid	= prev->pid;
+		__entry->prev_prio	= prev->prio;
+		__entry->prev_state	= __trace_sched_switch_state(prev);
+		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		__entry->next_pid	= next->pid;
+		__entry->next_prio	= next->prio;
+	),
+
+	TP_printk("raw_monotonic=%Lu prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->raw_monotonic, __entry->prev_comm,
+		__entry->prev_pid, __entry->prev_prio,
+		__entry->prev_state & (TASK_STATE_MAX-1) ?
+		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
+		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+);
+
+/*
  * Tracepoint for a task being migrated:
  */
 TRACE_EVENT(sched_migrate_task,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..0cc7a92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2179,6 +2179,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	trace_sched_switch(prev, next);
+	trace_sched_switch_raw_monotonic(prev, next);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
-- 
1.9.1
8<-------------

And than did some experiment, opening a perf session with this event and
executing the following test:

        clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_1);
        write(marker, "test", 4);
        clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_2);

where "marker" is a descriptor for
"/sys/kernel/debug/tracing/trace_marker". In other words, I generated a
unique event which will be timestamped by sched_clock. Then I averaged
t_m_1 and t_m_2 to get t_m_x, translated it into t_s_x with the formula
above and compared it with a timestamp for the marked, grepped from the
"/sys/kernel/debug/tracing/trace" (as it is directly injected into the
ftrace buffer, there is no way of getting it into perf data stream). On
average I was getting results accurate to or below 1 microsecond (that's
the resolution of a timestamp generated by standard ftrace output).

So, the approach works in general, however:

1. I guess adding monotonic raw field to the "vanilla" sched_switch
won't be praised... A clone like the above one doesn't make much sense
either, I think.

2. Even if this was possible (or we had a clone like the above), only
root can request trace events. This is acceptable for system-wide
profiling tools (which will run as root by definition), but can be a
problem for JIT engines, which usually run as unprivileged processes.

3. The accuracy, although not bad and sufficient in a lot of use cases
(eg. latencies when accessing hwmon sensors are big and unpredictable by
their nature, so a microsecond here and there don't matter at all), may
not be good enough for JIT engines - there must be a way of precisely
marking a moment between a new code (and debug information) being
generated and its first instruction executed. "trace_marker" could be
used for this, but as mentioned above, it's "invisible" to perf, and
even if it was, it would be root-only, see p.2.

With all this in mind, I though it would be possible to generalise the
"monotonic timestamping" at the perf interface level and duplicating the
marker idea there... A short RFC series to follow.

Pawel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/