linux-kernel - Re: [GIT PULL] perf fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABPqkBR2HeDmuTVg5RA=5D0xXJbXxqSxMDf7MW0iCGnurQb7jw@mail.gmail.com>
Date:	Thu, 14 Mar 2013 23:09:59 +0100
From:	Stephane Eranian <eranian@...gle.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Ingo Molnar <mingo@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Thomas Gleixner <tglx@...utronix.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] perf fixes

Hi,


On Thu, Mar 14, 2013 at 10:06 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Thu, Mar 14, 2013 at 1:32 PM, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > And to make things interesting, I seem to be able to only reproduce
> > this *after* a suspend cycle. That may be just happenstance, since it
> > seemed to be hard to replicate and most of the time it has happened
> > under X with no messages visible at all, but that *seems* to be the
> > pattern.
> >
> > And the one time I got it to happen on the text console, things
> > scrolled off (watchdog warnings due to lockups), but I did get a NULL
> > pointer dereference in intel_pmu_enable_all().
> >
> > I'll try to reproduce it and get a picture,
>
> Theory more or less confirmed.
>
> It does need a suspend/resume cycle, and I have a picture. The oops
> happens immediately when trying to do any perf work after the first
> suspend, before suspending I seem to be able to reliably use perf. It
> could still be just random flakiness, but I don't think so.
>
Could be related to suspend/resume. But were you running perf across
that resume/suspend cycle?



But still don't see how a wrmsrl could corrupt a cpuc.


>
> The NULL pointer dereference is at intel_pmu_enable_all+0x4d/0xa0 for
> me, which seems to be the load of the
>
>     if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
>
> thing. It says
>
>    BUG: unable to handle NULL pointer dereference at 0000000000000028
>
> But that error makes no sense. The code at that EIP is
>
>   48 8b 83 00 02 00 00 mov    0x200(%rbx),%rax     <-- trapping instruction
>
> and the value printed out for %rbx is 0xffff80014f20b8e0, so it should
> *not* be a NULL pointer dereference (and "cpuc" was also used just
> before the wrmsrl).


>
> So I suspect that the "wrmsrl" that was just before that instruction
> does something odd, and the PMU is in some odd state, so that the NULL
> pointer dereference actually has something to do with *that*, rather
> than the instruction itself.
>
> The callchain looks normal. It's
>
>   finish_task_switch ->
>     __perf_event_task_sched_in ->
>       perf_event_context_sched_in ->
>         perf_pmu_enable ->
>           x86_pmu_enable ->
>             intel_pmu_enable_all()
>
> The immediately preceding wrmsrl was done with rax=0xf, rdx=0x7,
> rcx=0x38f according to the register dump (but the picture isn't great,
> so the numbers aren't 100% reliable).
>
Value 0x38f for GLOBAL_CTRL is valid. And 0x70000000f is valid too
for the counter bitmask (4 generic counters + 3 fixed counters).

Let's see if we can reproduce the problem on the same ChromeBook you
have. Don't have one myself.

> Does this give any clues?
>
>              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/