linux-kernel - Re: [GIT PULL] perf changes for v3.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMsRxfLEO15kKrbmtKKXuW-JTtCCgiuXS6wFs9kiLmG1wge24A@mail.gmail.com>
Date:	Tue, 10 Sep 2013 13:28:44 +0200
From:	Stephane Eranian <eranian@...glemail.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Andi Kleen <andi@...stfloor.org>
Subject: Re: [GIT PULL] perf changes for v3.12

Hi,


And what was the perf record command line for this crash?


On Mon, Sep 9, 2013 at 12:05 PM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote:
>> This is new for me, but I suspect it is more related to the new
>> Haswell CPU I have than necessarily the 3.12 perf pull request.
>>
>> Regardless, nothing bad happened, but my dmesg has this in it:
>>
>>    Unexpected number of pebs records 10
>>
>> when I was profiling some git workloads. Full trace appended.
>
>>  ------------[ cut here ]------------
>>  WARNING: CPU: 1 PID: 3270 at
>> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
>> intel_pmu_drain_pebs_hsw+0x91/0xa0()
>>  Unexpected number of pebs records 10
>>  Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
>> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
>> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
>> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
>> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
>> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
>> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
>> i2c_algo_bit drm_kms_helper drm i2c_core video
>>  CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
>>  Hardware name:                  /DH87RL, BIOS
>> RLH8710H.86A.0317.2013.0426.1724 04/26/2013
>>   0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
>>   ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
>>   ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
>>  Call Trace:
>>   <NMI>  [<ffffffff815fc637>] dump_stack+0x45/0x56
>>   [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>>   [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>>   [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>>   [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>>   [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>>   [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>>   [<ffffffff81604840>] do_nmi+0xd0/0x310
>>   [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>>   <<EOE>>  [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>>   [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>>   [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>>   [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>>   [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>>   [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>>   [<ffffffff81600f48>] __schedule+0x368/0x7c0
>>   [<ffffffff816013c4>] schedule+0x24/0x70
>>   [<ffffffff816038a4>] retint_careful+0xd/0x21
>>  ---[ end trace 52dad7a9d8d96b5f ]---
>
> So I'm not entirely sure how this can happen. We do _not_ set
> ds->pebs_event_reset[] so once a PEBS assist happens we do not
> automatically re-engage. We wait for the PMI to happen, consume the data
> and reprogram the hardware.
>
> We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS
> entries to be in the buffer so each PEBS capable counter can trigger the
> assist once while we 'wait' for the PMI to hit.
>
> There's a little 'funny' with the hsw PEBS code in that it doesn't
> unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that
> would only make a difference if n actually was < 0 -- again something
> that shouldn't ever happen.
>
> That said, there's two peculiar things about your report. First and
> foremost HSW; I'm not at all sure how it deals with a transaction being
> undone if a PEBS assist had happened during it. Does it reset the
> hardware but not remove the data? If so, this could cause excess
> entries.
>
> Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013)
> seems to suggest not; the way I read it PEBS assist will cause an abort
> (and the PMI certainly would) and we'll process the entire thing outside
> of transaction context.
>
> That said, 18.10.5.1 mentions that entries from inside a transaction
> will only have 0xb0 and 0xb8 aka perf_record::real_ip and
> perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm()
> will do the right thing for them. In particular, when no
> perf_record::status bit is set we'll drop the event on the ground.
>
> Dropping them means we do not do intel_pmu_save_and_restart() to
> reprogram the hardware and the counter would get 'stuck'. Then again,
> without a status bit we have no idea what event to re-program so we're
> pretty much up shit creek without no stinkin' paddle.
>
> Another possibility is that the meaning of pebs_record::status has
> changed -- it was 'broken' and Peggy agreed and said she would look into
> it, not sure this has been 'fixed' but I couldn't find anything on it.
> That said, I think the code always assumed it was 'sane' so it having
> been fixed shouldn't matter.
>
> Secondly you seem to have triggered the nested NMI replay path. I'm not
> immediately sure how that could have messed up the PEBS state, but it
> sure is a complication.
>
> I think we can do something like the below as a pure code reduction --
> completely untested:
>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++---------------------
>  1 file changed, 32 insertions(+), 77 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 63438aa..bfda50e 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -178,20 +178,14 @@ struct pebs_record_nhm {
>         u64 status, dla, dse, lat;
>  };
>
> -/*
> - * Same as pebs_record_nhm, with two additional fields.
> - */
>  struct pebs_record_hsw {
> -       struct pebs_record_nhm nhm;
> -       /*
> -        * Real IP of the event. In the Intel documentation this
> -        * is called eventingrip.
> -        */
> -       u64 real_ip;
> -       /*
> -        * TSX tuning information field: abort cycles and abort flags.
> -        */
> -       u64 tsx_tuning;
> +       u64 flags, ip;
> +       u64 ax, bx, cx, dx;
> +       u64 si, di, bp, sp;
> +       u64 r8,  r9,  r10, r11;
> +       u64 r12, r13, r14, r15;
> +       u64 status, dla, dse, lat;
> +       u64 real_ip, tsx_tuning;
>  };
>
>  void init_debug_store_on_cpu(int cpu)
> @@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>                                    struct pt_regs *iregs, void *__pebs)
>  {
>         /*
> -        * We cast to pebs_record_nhm to get the load latency data
> -        * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
> +        * We cast to the biggest pebs_record but are careful not to
> +        * unconditionally access the 'extra' entries.
>          */
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct pebs_record_nhm *pebs = __pebs;
> -       struct pebs_record_hsw *pebs_hsw = __pebs;
> +       struct pebs_record_hsw *pebs = __pebs;
>         struct perf_sample_data data;
>         struct pt_regs regs;
>         u64 sample_type;
> @@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>         regs.sp = pebs->sp;
>
>         if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
> -               regs.ip = pebs_hsw->real_ip;
> +               regs.ip = pebs->real_ip;
>                 regs.flags |= PERF_EFLAGS_EXACT;
>         } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
>                 regs.flags |= PERF_EFLAGS_EXACT;
> @@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
>         __intel_pmu_pebs_event(event, iregs, at);
>  }
>
> -static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
> -                                       void *top)
> +static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
>  {
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>         struct debug_store *ds = cpuc->ds;
>         struct perf_event *event = NULL;
> +       void *at, *top;
>         u64 status = 0;
> -       int bit;
> +       int bit, n;
> +
> +       if (!x86_pmu.pebs_active)
> +               return;
> +
> +       at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> +       top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>
>         ds->pebs_index = ds->pebs_buffer_base;
>
> +       n = (top - at) / x86_pmu.pebs_record_size;
> +       if (n <= 0)
> +               return;
> +
> +       /*
> +        * Should not happen, we program the threshold at 1 and do not
> +        * set a reset value.
> +        */
> +       WARN_ONCE(n > x86_pmu.max_pebs_events,
> +                 "Unexpected number of pebs records %d\n", n);
> +
>         for (; at < top; at += x86_pmu.pebs_record_size) {
>                 struct pebs_record_nhm *p = at;
>
> @@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
>         }
>  }
>
> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> -{
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct debug_store *ds = cpuc->ds;
> -       struct pebs_record_nhm *at, *top;
> -       int n;
> -
> -       if (!x86_pmu.pebs_active)
> -               return;
> -
> -       at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> -       top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
> -
> -       ds->pebs_index = ds->pebs_buffer_base;
> -
> -       n = top - at;
> -       if (n <= 0)
> -               return;
> -
> -       /*
> -        * Should not happen, we program the threshold at 1 and do not
> -        * set a reset value.
> -        */
> -       WARN_ONCE(n > x86_pmu.max_pebs_events,
> -                 "Unexpected number of pebs records %d\n", n);
> -
> -       return __intel_pmu_drain_pebs_nhm(iregs, at, top);
> -}
> -
> -static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs)
> -{
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct debug_store *ds = cpuc->ds;
> -       struct pebs_record_hsw *at, *top;
> -       int n;
> -
> -       if (!x86_pmu.pebs_active)
> -               return;
> -
> -       at  = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
> -       top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
> -
> -       n = top - at;
> -       if (n <= 0)
> -               return;
> -       /*
> -        * Should not happen, we program the threshold at 1 and do not
> -        * set a reset value.
> -        */
> -       WARN_ONCE(n > x86_pmu.max_pebs_events,
> -                 "Unexpected number of pebs records %d\n", n);
> -
> -       return __intel_pmu_drain_pebs_nhm(iregs, at, top);
> -}
> -
>  /*
>   * BTS, PEBS probe and setup
>   */
> @@ -1039,7 +994,7 @@ void intel_ds_init(void)
>                 case 2:
>                         pr_cont("PEBS fmt2%c, ", pebs_type);
>                         x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw);
> -                       x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw;
> +                       x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
>                         break;
>
>                 default:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/