[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <55820AEC.6020204@linux.vnet.ibm.com>
Date: Thu, 18 Jun 2015 05:33:56 +0530
From: Hemant Kumar <hemant@...ux.vnet.ibm.com>
To: Arnaldo Carvalho de Melo <acme@...nel.org>
CC: maddy@...ux.vnet.ibm.com, srikar@...ux.vnet.ibm.com,
peterz@...radead.org, linux-kernel@...r.kernel.org,
kvm-ppc@...r.kernel.org, paulus@...ba.org, jolsa@...nel.org,
dsahern@...il.com, namhyung@...nel.org, sukadev@...ux.vnet.ibm.com,
linuxppc-dev@...ts.ozlabs.org, mingo@...nel.org
Subject: Re: [RFC PATCH] perf/kvm: Guest Symbol Resolution for powerpc
Hi Arnaldo,
On 06/16/2015 09:08 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Jun 16, 2015 at 08:20:53AM +0530, Hemant Kumar escreveu:
>> "perf kvm {record|report}" is used to record and report the performance
>> profile of any workload on a guest. From the host, we can collect
>> guest kernel statistics which is useful in finding out any contentions
>> in guest kernel symbols for a certain workload.
>>
>> This feature is not available on powerpc because "perf" relies on the
>> "cycles" event (a PMU event) to profile the guest. However, for powerpc,
>> this can't be used from the host because the PMUs are controlled by the
>> guest rather than the host.
>>
>> Due to this problem, we need a different approach to profile the
>> workload in the guest. There exists a tracepoint "kvm_hv:kvm_guest_exit"
>> in powerpc which is hit whenever any of the threads exit the guest
>> context. The guest instruction pointer dumped along with this
>> tracepoint data in the field "pc", can be used as guest instruction
>> pointer while postprocessing the trace data to map this IP to symbol
>> from guest.kallsyms.
>>
>> However, to have some kind of periodicity, we can't use all the kvm
>> exits, rather exits which are bound to happen in certain intervals.
>> HV_DECREMENTER Interrupt forces the threads to exit after an interval
>> of 10 ms.
>>
>> This patch makes use of the "kvm_guest_exit" tracepoint and checks the
>> exit reason for any kvm exit. If it is HV_DECREMENTER, then the
>> instruction pointer dumped along with this tracepoint is retrieved and
>> mapped with the guest kallsyms.
>>
>> This patch is a prototype asking for suggestions/comments as to whether
>> the approach is right or is there any way better than this (like using
>> a different event to profile for, etc) to profile the guest from the
>> host.
>>
>> Thank You.
>>
>> Signed-off-by: Hemant Kumar <hemant@...ux.vnet.ibm.com>
>> ---
>> tools/perf/arch/powerpc/Makefile | 1 +
>> tools/perf/arch/powerpc/util/parse-tp.c | 55 +++++++++++++++++++++++++++++++++
>> tools/perf/builtin-report.c | 9 ++++++
>> tools/perf/util/event.c | 7 ++++-
>> tools/perf/util/evsel.c | 7 +++++
>> tools/perf/util/evsel.h | 4 +++
>> tools/perf/util/session.c | 7 +++--
>> 7 files changed, 86 insertions(+), 4 deletions(-)
>> create mode 100644 tools/perf/arch/powerpc/util/parse-tp.c
>>
>> diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile
>> index 6f7782b..992a0d5 100644
>> --- a/tools/perf/arch/powerpc/Makefile
>> +++ b/tools/perf/arch/powerpc/Makefile
>> @@ -4,3 +4,4 @@ LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o
>> LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/skip-callchain-idx.o
>> endif
>> LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
>> +LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/parse-tp.o
>> diff --git a/tools/perf/arch/powerpc/util/parse-tp.c b/tools/perf/arch/powerpc/util/parse-tp.c
>> new file mode 100644
>> index 0000000..4c6e49c
>> --- /dev/null
>> +++ b/tools/perf/arch/powerpc/util/parse-tp.c
>> @@ -0,0 +1,55 @@
>> +#include "../../util/evsel.h"
>> +#include "../../util/trace-event.h"
>> +#include "../../util/session.h"
>> +
>> +#define KVMPPC_EXIT "kvm_hv:kvm_guest_exit"
>> +#define HV_DECREMENTER 2432
>> +#define HV_BIT 3
>> +#define PR_BIT 49
>> +#define PPC_MAX 63
>> +
>> +/*
>> + * Get the instruction pointer from the tracepoint data
>> + */
>> +u64 arch__get_ip(struct perf_evsel *evsel, struct perf_sample *data)
>> +{
>> + u64 tp_ip = data->ip;
>> + int trap;
>> +
>> + if (!strcmp(KVMPPC_EXIT, evsel->name)) {
> Can't you cache this somewhere? I.e. something like
>
> static int kvmppc_exit = -1;
>
> if (evsel->attr.type != PERF_TRACEPOINT)
> goto out;
>
> if (unlikely(kvmppc_exit == -1)) {
> if (strcmp(KVMPPC_EXIT, evsel->name)))
> goto out;
>
> kvmppc_exit = evsel->attr.config;
> } else (if kvmppc_exit != evsel->attr.config)
> goto out;
Will try this.
>
>> + trap = raw_field_value(evsel->tp_format, "trap", data->raw_data);
>> +
>> + if (trap == HV_DECREMENTER)
>> + tp_ip = raw_field_value(evsel->tp_format, "pc",
>> + data->raw_data);
> out:
>
>> + return tp_ip;
>> +}
>
> Also we have:
>
> u64 perf_evsel__intval(struct perf_evsel *evsel,
> struct perf_sample *sample, const char *name);
>
> So:
>
> trap = perf_evsel__intval(evsel, sample, "trap");
>
> And:
>
> tp_ip = perf_evsel__intval(evsel, sample, "pc");
>
> Makes it a bit shorter and allows for optimizations in how to find that
> field by name made at the evsel code.
Thanks, missed perf_evsel__intval, will use this in the next iteration.
> - Arnaldo
>
>> +
>> +/*
>> + * Get the HV and PR bits and accordingly, determine the cpumode
>> + */
>> +u8 arch__get_cpumode(union perf_event *event, struct perf_evsel *evsel,
>> + struct perf_sample *data)
>> +{
>> + unsigned long hv, pr, msr;
>> + u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
>> +
>> + if (strcmp(KVMPPC_EXIT, evsel->name))
>> + goto ret;
>> +
>> + if (data->raw_data)
>> + msr = raw_field_value(evsel->tp_format, "msr", data->raw_data);
>> + else
>> + goto ret;
>> +
>> + hv = msr & ((long unsigned)1 << (PPC_MAX - HV_BIT));
>> + pr = msr & ((long unsigned)1 << (PPC_MAX - PR_BIT));
>> +
>> + if (!hv && pr)
>> + cpumode = PERF_RECORD_MISC_GUEST_USER;
>> + else
>> + cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
>> +ret:
>> + return cpumode;
>> +}
>> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
>> index 072ae8a..e3fe5d0 100644
>> --- a/tools/perf/builtin-report.c
>> +++ b/tools/perf/builtin-report.c
>> @@ -141,6 +141,13 @@ out:
>> return err;
>> }
>>
>> +u8 __weak arch__get_cpumode(union perf_event *event,
>> + __maybe_unused struct perf_evsel *evsel,
>> + __maybe_unused struct perf_sample *sample)
>> +{
>> + return event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
>> +}
>> +
>> static int process_sample_event(struct perf_tool *tool,
>> union perf_event *event,
>> struct perf_sample *sample,
>> @@ -155,6 +162,8 @@ static int process_sample_event(struct perf_tool *tool,
>> };
>> int ret;
>>
>> + al.cpumode = arch__get_cpumode(event, evsel, sample);
>> +
> Why do it here? Other tools will need this as well, no? I.e. this should
I may be wrong, but since, we are profiling a guest, I assume that other
tools won't
be needing this except "perf kvm report", right?
> be done at perf_event__preprocess_sample(), that receives &al as a
> return location.
>
> Humm, but evsel is not passed to perf_event__preprocess_sample, argh,
> but yeah, at all perf_event__preprocess_sample() callsites the evsel is
> already resolved, so we just need to add it to
> perf_event__preprocess_sample() sig, now that we have a need for it.
Yeah, we can add evsel to the perf_event__preprocess_sample() params.
> I.e. this is like a userland fix for older kernels, right? Because
> other tools will have to do this patching too?
Yes, it should be a userland fix for older kernels (not older than 3.19,
though
because the tracepoint we are looking for is available from that version).
>> if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
>> pr_debug("problem processing %d event, skipping it.\n",
>> event->header.type);
>> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
>> index 6c6d044..693e37c 100644
>> --- a/tools/perf/util/event.c
>> +++ b/tools/perf/util/event.c
>> @@ -824,9 +824,14 @@ int perf_event__preprocess_sample(const union perf_event *event,
>> struct addr_location *al,
>> struct perf_sample *sample)
>> {
>> - u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
>> struct thread *thread = machine__findnew_thread(machine, sample->pid,
>> sample->tid);
>> + u8 cpumode;
>> +
>> + if (al->cpumode != PERF_RECORD_MISC_CPUMODE_UNKNOWN)
>> + cpumode = al->cpumode;
>> + else
>> + cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> I.e. here, where we should try to avoid looking at anything in 'al'.
Yes.
>
[SNIP]
Thanks a lot for the review Arnaldo.
Will try the suggestions in the next iteration.
--
Thanks,
Hemant Kumar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists