[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1608200035540.26733@macbook-air>
Date: Sat, 20 Aug 2016 00:44:54 -0400 (EDT)
From: Vince Weaver <vincent.weaver@...ne.edu>
To: Peter Zijlstra <peterz@...radead.org>
cc: Vince Weaver <vincent.weaver@...ne.edu>,
linux-kernel@...r.kernel.org, Borislav Petkov <bp@...e.de>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Huang Rui <ray.huang@....com>
Subject: Re: perf: fuzzer crashes immediately on AMD system
On Fri, 19 Aug 2016, Peter Zijlstra wrote:
> On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > On Thu, 18 Aug 2016, Vince Weaver wrote:
> >
> > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > falls over more or less immediately.
> > >
> > > This maps to variable_test_bit()
> > > called by ctx = find_get_context(pmu, task, event);
> > > in kernel/events/core.c:9467
> > >
> > > It happens quickly enough I can probably track down the exact event that
> > > causes this, if needed.
> >
> > I have a one line reproducer:
> >
> > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
>
> OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> various manuals to see if I can spot the fail.
>
> Huang could you either prod someone at AMD or do yourself, audit the AMD
> perf code for all the various new models?
This is bizzarre, I can't make any sense of the crash.
To recap, the crash looks like this:
BUG: unable to handle kernel paging request at ffffffff85e67600
IP: [<ffffffff810e4cb1>] find_get_context.isra.75+0x28/0x20f
The code in question is this code:
if (!cpu_online(cpu))
which maps to
test_bit(cpumask_check(cpu), cpumask_bits((cpumask)));
which assembles to
ffffffff810e4ca9: 41 89 cc mov %ecx,%r12d
ffffffff810e4cac: 7f 1e jg ffffffff810e4ccc <find_get_context.isra.75+0x43>
ffffffff810e4cae: 44 89 e0 mov %r12d,%eax
* ffffffff810e4cb1: 48 0f a3 05 87 0f 7f bt %rax,0x7f0f87(%rip) # ffffffff818d5c40 <__cpu_online_mask>
ffffffff810e4cb8: 00
ffffffff810e4cb9: 0f 92 c0 setb %al
ffffffff810e4cbc: 84 c0 test %al,%al
There is no way that 0x7f0f87(%rip) should ever possibly be the
ffffffff85e67600 value that causes the fault.
Though oddly rax when the call happens (according to the oops message)
is RAX: 0000000022c8ce30 which seems nonsensical for a CPU number, but
shouldn't cause an invalid memory address. Also oddly RDI matches
RAX but RCX doesn't which I think should be true with that assembly.
So very weird. I even wrote a kernel module and dumped the raw kernel
memory to make sure the instruction stream didn't get overwritten somehow,
but as far as I can tell the code in memory matches the disassembly.
anyway I am out of time to look at this for now.
Vince
Powered by blists - more mailing lists