[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4d924f705245c797a19d3a73eb0c1ba0@kernel.org>
Date: Thu, 30 Apr 2020 16:29:23 +0100
From: Marc Zyngier <maz@...nel.org>
To: Will Deacon <will@...nel.org>
Cc: Leo Yan <leo.yan@...aro.org>, Mark Rutland <mark.rutland@....com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...hat.com>,
Namhyung Kim <namhyung@...nel.org>,
Catalin Marinas <catalin.marinas@....com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Mathieu Poirier <mathieu.poirier@...aro.org>,
Mike Leach <mike.leach@...aro.org>,
Al Grant <Al.Grant@....com>, James Clark <James.Clark@....com>,
tglx@...utronix.de
Subject: Re: [PATCH] arm64: perf_event: Fix time_offset for arch timer
On 2020-04-30 15:58, Will Deacon wrote:
> Hi Leo,
>
> [+Maz and tglx in case I'm barking up the wrong tree]
>
> On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote:
>> Between the system powering on and kernel's sched clock registration,
>> the arch timer usually has been enabled at the early time and its
>> counter is incremented during the period of the booting up. Thus the
>> arch timer's counter is not completely accounted into the sched clock,
>> and has a delta between the arch timer's counter and sched clock.
>> This
>> delta value should be stored into userpg->time_offset, which later can
>> be retrieved by Perf tool in the user space for sample timestamp
>> calculation.
>>
>> Now userpg->time_offset is assigned to the negative sched clock with
>> '-now', this value cannot reflect the delta between arch timer's
>> counter
>> and sched clock, so Perf cannot use it to calculate the sample time.
>>
>> To fix this issue, this patch calculate the delta between the arch
>> timer's and sched clock and assign the delta to userpg->time_offset.
>> The detailed steps are firstly to convert counter to nanoseconds 'ns',
>> then the offset is calculated as 'now' minus 'ns'.
>>
>> |<------------------- 'ns' ---------------------->|
>> |<-------- 'now' -------->|
>> |<---- time_offset ---->|
>> |-----------------------|-------------------------|
>> ^ ^ ^
>> Power on system sched clock registration Perf starts
>
> FWIW, I'm /really/ struggling to understand the problem here.
>
> If I've grokked it correctly (big 'if'), then you can't just factor in
> what you call "time_offset" in the diagram above, because there isn't
> a guarantee that the counter is zero-initialised at the start.
Even if it was, we have no idea of *when* that was. Think kexec, for a
start. Or spending some variable in firmware because of $REASON.
>
>> Signed-off-by: Leo Yan <leo.yan@...aro.org>
>> ---
>> arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kernel/perf_event.c
>> b/arch/arm64/kernel/perf_event.c
>> index e40b65645c86..226d25d77072 100644
>> --- a/arch/arm64/kernel/perf_event.c
>> +++ b/arch/arm64/kernel/perf_event.c
>> @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event
>> *event,
>> {
>> u32 freq;
>> u32 shift;
>> + u64 count, ns, quot, rem;
>>
>> /*
>> * Internal timekeeping for enabled/running/stopped times
>> @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct
>> perf_event *event,
>> userpg->time_mult >>= 1;
>> }
>> userpg->time_shift = (u16)shift;
>> - userpg->time_offset = -now;
>> +
>> + /*
>> + * Since arch timer is enabled ealier than sched clock registration,
>> + * compuate the delta (in nanosecond unit) between the arch timer
>> + * counter and sched clock, assign the delta to time_offset and
>> + * perf tool can use it for timestamp calculation.
>> + *
>> + * The formula for conversion arch timer cycle to ns is:
>> + * quot = (cyc >> time_shift);
>> + * rem = cyc & ((1 << time_shift) - 1);
>> + * ns = quot * time_mult + ((rem * time_mult) >> time_shift);
>> + */
>> + count = arch_timer_read_counter();
>> + quot = count >> shift;
>> + rem = count & ((1 << shift) - 1);
>> + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >>
>> shift);
>> + userpg->time_offset = now - ns;
>
> Hmm, reading the counter and calculating the delta feels horribly
> approximate to me. It would be much better if we could get hold of the
> initial epoch cycles from the point at which sched_clock was
> initialised
> using the counter. This represents the true cycle delta between the
> counter
> and what sched_clock uses for 0 ns.
I think this is a sensible solution if you want an epoch that starts at
0 with
sched_clock being initialized. The other question is whether it is
possible to
use a different timestamping source for perf that wouldn't need to be
offset.
> Unfortunately, I can't see a straightforward way to grab that
> information.
> It looks like x86 pulls this directly from the TSC driver.
I wonder if we could/should make __sched_clock_offset available even
when
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
help with this particular can or worm...
M.
--
Jazz is not dead. It just smells funny...
Powered by blists - more mailing lists