[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120206132546.GA30854@quad>
Date: Mon, 6 Feb 2012 14:25:46 +0100
From: Stephane Eranian <eranian@...gle.com>
To: linux-kernel@...r.kernel.org
Cc: acme@...hat.com, peterz@...radead.org, mingo@...e.hu,
robert.richter@....com, eric.dumazet@...il.com
Subject: [BUG] perf: perf sched warning possibly due to clock granularity on
AMD
Hi,
I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
I was testing perf sched and I ran into an issue. That issue seems
to exist only on AMD and not on Intel systems. It is not PMU related
because I am doing tracing.
I am running a simple ping pong test to stress the context switch
code. Two processes exchanging a byte through a pipe (program provided
below).
Then, I capture a trace using perf sched, and I run sched lat, but
on certain runs, I get:
$ perf sched rec pong 2
$ perf sched lat
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: TimesFound 4934 unknown events!
Is this an older tool processing a perf.data file generated by a more recent tool?
I did some investigation in the kernel to figure out what
could cause this problem. I found out that with this measurement, it is
possible to get multiple consecutive samples with the SAME timestamp.
That is possible on AMD because, apparently, local_clock() ends up calling,
a sched_clock() function that is only based on jiffies. That means, the
granularity is a tick (1ms in my case). On Intel systems, at least on mines,
local_clock() ends up calling a routine which uses TSC and thus has a much
greater precision. This is all decided by the sched_clock_stable variable. It
is set at init time for Intel and not used on AMD. Yet, recent AMD systems do
have support for NONSTOP_TSC. It is not clear to me why that would not be enough
(unless it is a cross-socket synchronization issue).
One thing is clear, is that if I enable sched_clock_stable on my AMD system, then
the perf sched warning disappears.
I believe there is really a bug in perf. I am sure there are systems out there
which cannot provide a very fine grain timestamp, thus it is possible to get samples
with identical timestamps. I don't quite understand the sample re-ordering code in
perf but it seems sensitive to timer granularity. I think it should not.
In summary, two issues:
- Why is sched_clock_stable not set or even tested on recent AMD systems?
- perf should not rely on fine granularity timestamps in the re-ordering code
Any comments?
#include <sys/types.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <signal.h>
#include <syscall.h>
#include <err.h>
#define MAX_CPUS 64
#define NR_CPU_BITS (MAX_CPUS>>3)
int
pin_cpu(pid_t pid, unsigned int cpu)
{
uint64_t my_mask[NR_CPU_BITS];
if (cpu >= MAX_CPUS)
errx(1, "this program supports only up to %d CPUs", MAX_CPUS);
my_mask[cpu>>6] = 1ULL << (cpu&63);
return syscall(__NR_sched_setaffinity, pid, sizeof(my_mask), &my_mask);
}
static volatile int quit;
void sig_handler(int n)
{
quit = 1;
}
static void
do_child(int fr, int fw)
{
char c;
ssize_t ret;
for(;;) {
ret = read(fr, &c, 1);
if (ret < 0)
break;
ret = write(fw, "c", 1);
if (ret < 0)
break;
}
printf("child exited\n");
exit(0);
}
int
main(int argc, char **argv)
{
int ret;
int pr[2], pw[2];
int which_cpu;
pid_t pid;
uint64_t nctx = 0;
ssize_t nbytes;
char c = '0';
int delay;
delay = argc > 1 ? atoi(argv[1]) : 10;
srandom(getpid());
which_cpu = random() % sysconf(_SC_NPROCESSORS_ONLN);
ret = pipe(pr);
if (ret)
err(1, "cannot create read pipe");
ret = pipe(pw);
if (ret)
err(1, "cannot create write pipe");
ret = pin_cpu(getpid(), which_cpu);
if (ret)
err(1, "cannot pin to CPU%d", which_cpu);
printf("Both processes pinned to CPU%d, running for %ds\n", which_cpu, delay);
switch(pid=fork()) {
case -1:
err(1, "cannot create child");
case 0:
close(pr[1]); close(pw[0]);
do_child(pr[0], pw[1]);
exit(1);
}
close(pr[0]);
close(pw[1]);
signal(SIGALRM, sig_handler);
alarm(delay);
while(!quit) {
nbytes = write(pr[1], "c", 1);
nbytes = read(pw[0], &c, 1);
nctx++;
}
close(pr[1]);
close(pw[0]);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists