[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753711263@SHSMSX103.ccr.corp.intel.com>
Date: Fri, 23 Jun 2017 17:21:15 +0000
From: "Liang, Kan" <kan.liang@...el.com>
To: "peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"eranian@...gle.com" <eranian@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "alexander.shishkin@...ux.intel.com"
<alexander.shishkin@...ux.intel.com>,
"acme@...hat.com" <acme@...hat.com>,
"jolsa@...hat.com" <jolsa@...hat.com>,
"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"vincent.weaver@...ne.edu" <vincent.weaver@...ne.edu>,
"ak@...ux.intel.com" <ak@...ux.intel.com>,
"Don Zickus (dzickus@...hat.com)" <dzickus@...hat.com>
Subject: RE: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP
counter
Hi all,
Any comments for the series?
We had some users (from both client and server) reported spurious
NMI watchdog timeouts issue.
Now, there is a proposed workaround fix from tglx. We are testing it.
However, this patch series is believed to be a real fix for the issue.
It's better that the patch series can be merged into mainline.
Here is the workaround fix, if you are interested.
https://patchwork.kernel.org/patch/9803033/
https://patchwork.kernel.org/patch/9805903/
Thanks,
Kan
>
> From: Kan Liang <Kan.liang@...el.com>
>
> The dominant motivation is to make it possible to switch cycles NMI
> watchdog to ref_cycles on x86 platform. The new NMI watchdog could
> - Free widely used precious cycles fixed counter. For example, topdown
> code needs the cycles fixed counter in group counting. Otherwise,
> it has to fall back to not use groups, which can cause measurement
> inaccuracy due to multiplexing errors.
> - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster
> than the measured CPU Frequency due to Turbo mode.
> Although we can enlarge the period of cycles to workaround the fast
> expiring, it is still limited by the Turbo ratio and may fail
> eventually.
>
> CPU ref_cycles can only be counted by fixed counter 2. It uses pseudo-
> encoding. The GP counter doesn't recognize.
>
> BUS_CYCLES (0x013c) is another event which is not affected by core
> frequency changes. It has a constant ratio with the CPU ref_cycles.
> BUS_CYCLES could be used as an alternative event for ref_cycles on GP
> counter.
> A hook is implemented in x86_schedule_events. If the fixed counter 2 is
> occupied and a GP counter is assigned, BUS_CYCLES is used to replace
> ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in hw_perf_event
> is introduced to indicate the replacement.
> To make the switch transparent for perf tool, counting and sampling are also
> specially handled.
> - For counting, it multiplies the result with the constant ratio after
> reading it.
> - For sampling with fixed period, the BUS_CYCLES period = ref_cycles
> period / the constant ratio.
> - For sampling with fixed frequency, the adaptive frequency algorithm
> will figure it out by calculating ref_cycles event's last_period and
> event counts. But the reload value has to be BUS_CYCLES left period.
>
> User visible RDPMC issue
> It is impossible to transparently handle the case, which reading ref-cycles by
> RDPMC instruction in user space.
> ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
> For the few users who want to read ref-cycles by RDPMC, they have to
> correct the result by multipling ref_cycles_factor_for_rdpmc manually, if GP
> counter is used.
> The solution is only for ref-cycles. It doesn't impact other events'
> result.
>
> The constant ratio is model specific.
> For the model after NEHALEM but before Skylake, the ratio is defined in
> MSR_PLATFORM_INFO.
> For the model after Skylake, it can be get from CPUID.15H.
> For Knights Landing, Goldmont and later, the ratio is always 1.
>
> The old Silvermont/Airmont, Core2 and Atom machines are not covered by
> the patch. The behavior on those machines will not change.
>
> Signed-off-by: Kan Liang <Kan.liang@...el.com>
> ---
>
> Changes since V1:
> - Retry the whole scheduling thing for 0x0300 replacement event,
> if 0x0300 event is unscheduled.
> - Correct the method to calculate event->counter, period_left, left
> and offset in different cases
> - Expose the factor to user space
> - Modify the changelog
> - There is no transparent way to handle ref-cycles replacement events for
> RDPMC. RDPMC users have to multiply the results with factor if the
> ref-cycles replacement event is scheduled in GP counters.
> But it will not impact other events.
> There are few RDPMC users who use ref cycles. The impact should be
> limited.
> I will also send patches to other user tools, such as pmu-tools and PAPI,
> to minimize the impact.
>
>
> arch/x86/events/core.c | 92 ++++++++++++++++++++++++++++++++++--
> arch/x86/events/intel/core.c | 110
> ++++++++++++++++++++++++++++++++++++++++++-
> arch/x86/events/perf_event.h | 5 ++
> 3 files changed, 203 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> e6f5e4b..18f8d37 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
> int shift = 64 - x86_pmu.cntval_bits;
> u64 prev_raw_count, new_raw_count;
> int idx = hwc->idx;
> - u64 delta;
> + u64 delta, adjust_delta;
>
> if (idx == INTEL_PMC_IDX_FIXED_BTS)
> return 0;
> @@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event
> *event)
> delta = (new_raw_count << shift) - (prev_raw_count << shift);
> delta >>= shift;
>
> - local64_add(delta, &event->count);
> - local64_sub(delta, &hwc->period_left);
> + /*
> + * Correct the count number if applying ref_cycles replacement.
> + * There is a constant ratio between ref_cycles (event A) and
> + * ref_cycles replacement (event B). The delta result is from event B.
> + * To get accurate event A count, the real delta result must be
> + * multiplied with the constant ratio.
> + *
> + * It is handled differently for sampling and counting.
> + * - Fixed period Sampling: The period is already adjusted in
> + * scheduling for event B. The period_left is the remaining period
> + * for event B. Don't need to modify period_left.
> + * It's enough to only adjust event->count here.
> + *
> + * - Fixed freq Sampling: The adaptive frequency algorithm needs
> + * the last_period and event counts for event A to estimate the next
> + * period for event A. So the period_left is the remaining period
> + * for event A. It needs to multiply the result with the ratio.
> + * However, the period_left is also used to reload the event counter
> + * for event B in x86_perf_event_set_period. It has to be adjusted to
> + * event B's remaining period there.
> + *
> + * - Counting: It's enough to only adjust event->count for perf stat.
> + *
> + * - RDPMC: User can also read the counter directly by rdpmc/mmap.
> + * Users have to handle the adjustment themselves.
> + * For kernel, it only needs to guarantee that the offset is correct.
> + * In x86's arch_perf_update_userpage, the offset will be corrected
> + * if event B is used.
> + */
> + adjust_delta = delta;
> + if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> + adjust_delta = delta * x86_pmu.ref_cycles_factor;
> +
> + if (is_sampling_event(event) && event->attr.freq)
> + local64_sub(adjust_delta, &hwc->period_left);
> + else
> + local64_sub(delta, &hwc->period_left);
> + } else
> + local64_sub(delta, &hwc->period_left);
> +
> + local64_add(adjust_delta, &event->count);
>
> return new_raw_count;
> }
> @@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events
> *cpuc, int n, int *assign)
> if (x86_pmu.start_scheduling)
> x86_pmu.start_scheduling(cpuc);
>
> +retry:
> for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> cpuc->event_constraint[i] = NULL;
> c = x86_pmu.get_event_constraints(cpuc, i, cpuc-
> >event_list[i]); @@ -952,6 +992,25 @@ int x86_schedule_events(struct
> cpu_hw_events *cpuc, int n, int *assign)
> */
> if (x86_pmu.put_event_constraints)
> x86_pmu.put_event_constraints(cpuc, e);
> +
> + /*
> + * 0x0300 is pseudo-encoding for REF_CPU_CYCLES.
> + * It can only be scheduled on fixed counter 2.
> + * If the fixed counter 2 is occupied, the event is
> + * failed scheduling.
> + *
> + * If that's the case, an alternative event, which
> + * can be counted in GP counters, could be used to
> + * replace the pseudo-encoding REF_CPU_CYCLES
> event.
> + *
> + * Here we reset the event and retry the whole
> + * scheduling thing if there is unsched 0x0300 event.
> + */
> + if (unsched && x86_pmu.ref_cycles_rep &&
> + ((e->hw.config & X86_RAW_EVENT_MASK) ==
> 0x0300)) {
> + x86_pmu.ref_cycles_rep(e);
> + goto retry;
> + }
> }
> }
>
> @@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event
> *event)
> hwc->last_period = period;
> ret = 1;
> }
> +
> + /*
> + * Adjust left for ref_cycles replacement.
> + * Please refer the comments in x86_perf_event_update for details.
> + */
> + if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) &&
> + event->attr.freq)
> + left /= (u64)x86_pmu.ref_cycles_factor;
> /*
> * Quirk: certain CPUs dont like it if just 1 hw_event is left:
> */
> @@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void)
> x86_pmu_attr_group.attrs = tmp;
> }
>
> + if (x86_pmu.factor_attrs) {
> + struct attribute **tmp;
> +
> + tmp = merge_attr(x86_pmu_attr_group.attrs,
> x86_pmu.factor_attrs);
> + if (!WARN_ON(!tmp))
> + x86_pmu_attr_group.attrs = tmp;
> + }
> +
> pr_info("... version: %d\n", x86_pmu.version);
> pr_info("... bit width: %d\n", x86_pmu.cntval_bits);
> pr_info("... generic registers: %d\n", x86_pmu.num_counters);
> @@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct
> perf_event *event,
> }
>
> cyc2ns_read_end(data);
> +
> + /*
> + * offset should be corrected if ref cycles replacement is used.
> + * Please refer the comments in x86_perf_event_update for details.
> + */
> + if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> + userpg->offset = __perf_event_count(event);
> + userpg->offset /= (u64)x86_pmu.ref_cycles_factor;
> + if (userpg->index)
> + userpg->offset -= local64_read(&event-
> >hw.prev_count);
> + }
> }
>
> void
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index
> da9047e..4853795 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct
> perf_event *event, unsigned left)
> return left;
> }
>
> +static __init unsigned int glm_get_ref_cycles_factor(void) {
> + return 1;
> +}
> +
> +static __init unsigned int nhm_get_ref_cycles_factor(void) {
> + u64 platform_info;
> +
> + rdmsrl(MSR_PLATFORM_INFO, platform_info);
> + return (platform_info >> 8) & 0xff;
> +}
> +
> +static __init unsigned int skl_get_ref_cycles_factor(void) {
> + unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused;
> +
> + cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused);
> + if (!cpuid21_eax || !cpuid21_ebx)
> + return 0;
> +
> + return cpuid21_ebx / cpuid21_eax;
> +}
> +
> +/*
> + * BUS_CYCLES (0x013c) is another event which is not affected by core
> + * frequency changes. It has a constant ratio with the CPU ref_cycles.
> + * BUS_CYCLES could be used as an alternative event for ref_cycles on
> + * GP counter.
> + */
> +void nhm_ref_cycles_rep(struct perf_event *event) {
> + struct hw_perf_event *hwc = &event->hw;
> +
> + hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) |
> +
> intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES];
> + hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP;
> +
> + /* adjust the sample period for ref_cycles replacement event */
> + if (is_sampling_event(event) && hwc->sample_period != 1) {
> +
> + hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor;
> + if (hwc->sample_period < 2)
> + hwc->sample_period = 2;
> + hwc->last_period = hwc->sample_period;
> + local64_set(&hwc->period_left, hwc->sample_period);
> + }
> +}
> +
> PMU_FORMAT_ATTR(event, "config:0-7" );
> PMU_FORMAT_ATTR(umask, "config:8-15" );
> PMU_FORMAT_ATTR(edge, "config:18" );
> @@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = {
> NULL,
> };
>
> +
> +static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor); }
> +
> +DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc);
> +
> +static struct attribute *intel_ref_cycles_factor_attrs[] = {
> + &dev_attr_ref_cycles_factor_for_rdpmc.attr,
> + NULL,
> +};
> +
> __init int intel_pmu_init(void)
> {
> union cpuid10_edx edx;
> @@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
>
> intel_pmu_pebs_data_source_nhm();
> x86_add_quirk(intel_nehalem_quirk);
> -
> + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
> pr_cont("Nehalem events, ");
> break;
>
> @@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
> x86_pmu.lbr_pt_coexist = true;
> x86_pmu.flags |= PMU_FL_HAS_RSP_1;
> x86_pmu.cpu_events = glm_events_attrs;
> + x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
> pr_cont("Goldmont events, ");
> break;
>
> @@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void)
>
> X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);
>
> intel_pmu_pebs_data_source_nhm();
> + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
> pr_cont("Westmere events, ");
> break;
>
> @@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void)
> /* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/
>
> intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BAC
> KEND] =
>
> X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
> + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
>
> pr_cont("SandyBridge events, ");
> break;
> @@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void)
> /* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
>
> intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRO
> NTEND] =
>
> X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
> + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
>
> pr_cont("IvyBridge events, ");
> break;
> @@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void)
> x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
> x86_pmu.cpu_events = hsw_events_attrs;
> x86_pmu.lbr_double_abort = true;
> + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
> pr_cont("Haswell events, ");
> break;
>
> @@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void)
> x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
> x86_pmu.cpu_events = hsw_events_attrs;
> x86_pmu.limit_period = bdw_limit_period;
> + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
> pr_cont("Broadwell events, ");
> break;
>
> @@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void)
> /* all extra regs are per-cpu when HT is on */
> x86_pmu.flags |= PMU_FL_HAS_RSP_1;
> x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
> + x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
>
> pr_cont("Knights Landing/Mill events, ");
> break;
> @@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void)
> skl_format_attr);
> WARN_ON(!x86_pmu.format_attrs);
> x86_pmu.cpu_events = hsw_events_attrs;
> + x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor();
> + if (x86_pmu.ref_cycles_factor) {
> + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> + }
> pr_cont("Skylake events, ");
> break;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 53728ee..3470178 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -68,6 +68,7 @@ struct event_constraint {
> #define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */
> #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-
> reload */
> #define PERF_X86_EVENT_FREERUNNING 0x0800 /* use freerunning
> PEBS */
> +#define PERF_X86_EVENT_REF_CYCLES_REP 0x1000 /* use ref_cycles
> replacement */
>
>
> struct amd_nb {
> @@ -550,6 +551,8 @@ struct x86_pmu {
> int perfctr_second_write;
> bool late_ack;
> unsigned (*limit_period)(struct perf_event *event, unsigned l);
> + unsigned int ref_cycles_factor;
> + void (*ref_cycles_rep)(struct perf_event *event);
>
> /*
> * sysfs attrs
> @@ -565,6 +568,8 @@ struct x86_pmu {
> unsigned long attr_freeze_on_smi;
> struct attribute **attrs;
>
> + unsigned int attr_ref_cycles_factor;
> + struct attribute **factor_attrs;
> /*
> * CPU Hotplug hooks
> */
> --
> 2.7.4
Powered by blists - more mailing lists