[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50A17A9A.1060400@linaro.org>
Date: Mon, 12 Nov 2012 14:39:22 -0800
From: John Stultz <john.stultz@...aro.org>
To: Stephane Eranian <eranian@...gle.com>
CC: Peter Zijlstra <peterz@...radead.org>,
LKML <linux-kernel@...r.kernel.org>,
"mingo@...e.hu" <mingo@...e.hu>, Paul Mackerras <paulus@...ba.org>,
Anton Blanchard <anton@...ba.org>,
Will Deacon <will.deacon@....com>,
"ak@...ux.intel.com" <ak@...ux.intel.com>,
Pekka Enberg <penberg@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Robert Richter <robert.richter@....com>,
tglx <tglx@...utronix.de>
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples
with kernel samples
On 11/12/2012 12:54 PM, Stephane Eranian wrote:
> On Mon, Nov 12, 2012 at 7:53 PM, John Stultz <john.stultz@...aro.org> wrote:
>> On 11/11/2012 12:32 PM, Stephane Eranian wrote:
>>> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@...aro.org>
>>> wrote:
>>>> Also I worry that it will be abused in the same way that direct TSC
>>>> access
>>>> is, where the seemingly better performance from the more careful/correct
>>>> CLOCK_MONOTONIC would cause developers to write fragile userland code
>>>> that
>>>> will break when moved from one machine to the next.
>>>>
>>> The only goal for this new time source is for correlating user-level
>>> samples with
>>> kernel level samples, i.e., application level events with a PMU counter
>>> overflow
>>> for instance. Anybody trying anything else would be on their own.
>>>
>>> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
>>> that used by the perf_event subsystem to timestamp samples when
>>> PERF_SAMPLE_TIME is requested in attr->sample_type.
>>
>> I'm not familiar enough with perf's interfaces, but if you are going to make
>> this clockid bound so tightly with perf, could you maybe export a perf
>> timestamp from one of perf's interfaces rather then using the more generic
>> clock_gettime() interface?
>>
> Yeah, I considered that as well. But it is more complicated. The only syscall
> we could extend for perf_events is ioctl(). But that one requires that an
> event be created so we obtain a file descriptor for the ioctl() call
> So we'd have to
> pretend programming a dummy event just for the purpose of obtained a timestamp.
> We could do that but that's not so nice. But more amenable to the
Sorry, you trailed off. Did you want to finish that thought? (I do
that all the time. :)
> Keep in mind that the clock_gettime() would be used by programs which are not
> self-monitoring but may be monitored externally by a tool such as perf. We just
> need to them to emit their events with a timestamp that can be
> correlated offline
> with those of perf_events.
Again, forgive me for not really knowing much about perf here, but could
you have a perf log an event when clock_gettime() was called, possibly
recording the returned value, so you could correlate that data yourself?
>>>> I'd probably rather perf output timestamps to userland using sane clocks
>>>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>>>> userland. But I probably could be convinced I'm wrong.
>>>>
>>> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
>>> grabbing any locks because that would need to run from NMI context?
>> No, of course why we have sched_clock. But I'm suggesting we consider
>> changing what perf exports (via maybe interpolation/translation) to be
>> CLOCK_MONOTONIC-ish.
>>
> Explain to me the key difference between monotonic and what sched_clock()
> is returning today? Does this have to do with the global monotonic vs.
> the cpu-wide
> monotonic?
So CLOCK_MONOTONIC is the number of NTP corrected (for accuracy) seconds
+ nsecs that the machine has been up for (so that doesn't include time
in suspend). Its promised to be globally monotonic across cpus.
In my understanding, sched_clock's definition has changed over time. It
used to be a fast but possibly incorrect nanoseconds since boot, but
with suspend and other events it could reset/overflow and users (then
only the scheduler) would be able to deal with it. It also wasn't
guaranteed to be consistent across cpus. So it was limited to
calculating approximate time intervals on a single cpu.
However, with cfs (And Peter or Ingo could probably hop in and clarify
further) I believe it started to require some cross-cpu consistency and
reset events would cause probelms with the scheduler, so additional
layers have been added to try to enforce these additional requirements.
I suspect they aren't that far off, except calibration frequency errors
go uncorrected with sched_clock. But was thinking you could get periodic
timestamps in perf that correlated CLOCK_MONOTONIC with sched_clock and
then allow the kernel to interpolate the sched_clock times out to
something pretty close to CLOCK_MONOTONIC. That way perf wouldn't leak
the sched_clock time domain to userland.
Again, sorry for being a pain here. The CLOCK_PERF would be a easy
solution, but I just want to make sure its really the best one long term.
thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists