linux-kernel - Re: perf: fuzzer crashes immediately on AMD system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Sat, 20 Aug 2016 00:44:54 -0400 (EDT)
From:   Vince Weaver <vincent.weaver@...ne.edu>
To:     Peter Zijlstra <peterz@...radead.org>
cc:     Vince Weaver <vincent.weaver@...ne.edu>,
        linux-kernel@...r.kernel.org, Borislav Petkov <bp@...e.de>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Huang Rui <ray.huang@....com>
Subject: Re: perf: fuzzer crashes immediately on AMD system

On Fri, 19 Aug 2016, Peter Zijlstra wrote:

> On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > 
> > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > falls over more or less immediately.
> > > 
> > > This maps to variable_test_bit()
> > > 	called by ctx = find_get_context(pmu, task, event);
> > > 		in kernel/events/core.c:9467
> > > 
> > > It happens quickly enough I can probably track down the exact event that 
> > > causes this, if needed.
> > 
> > I have a one line reproducer:
> > 
> > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> 
> OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> various manuals to see if I can spot the fail.
> 
> Huang could you either prod someone at AMD or do yourself, audit the AMD
> perf code for all the various new models?

This is bizzarre, I can't make any sense of the crash.

To recap, the crash looks like this:
	BUG: unable to handle kernel paging request at ffffffff85e67600
	IP: [<ffffffff810e4cb1>] find_get_context.isra.75+0x28/0x20f

The code in question is this code:

	if (!cpu_online(cpu))

	which maps to 
	test_bit(cpumask_check(cpu), cpumask_bits((cpumask)));

	which assembles to

	ffffffff810e4ca9:       41 89 cc                mov    %ecx,%r12d
	ffffffff810e4cac:       7f 1e                   jg     ffffffff810e4ccc <find_get_context.isra.75+0x43>
	ffffffff810e4cae:       44 89 e0                mov    %r12d,%eax
*	ffffffff810e4cb1:       48 0f a3 05 87 0f 7f    bt     %rax,0x7f0f87(%rip)        # ffffffff818d5c40 <__cpu_online_mask>
	ffffffff810e4cb8:       00 
	ffffffff810e4cb9:       0f 92 c0                setb   %al
	ffffffff810e4cbc:       84 c0                   test   %al,%al

There is no way that 0x7f0f87(%rip) should ever possibly be the 
ffffffff85e67600 value that causes the fault.

Though oddly rax when the call happens (according to the oops message)
is RAX: 0000000022c8ce30 which seems nonsensical for a CPU number, but
shouldn't cause an invalid memory address.  Also oddly RDI matches
RAX but RCX doesn't which I think should be true with that assembly.

So very weird.  I even wrote a kernel module and dumped the raw kernel
memory to make sure the instruction stream didn't get overwritten somehow,
but as far as I can tell the code in memory matches the disassembly.

anyway I am out of time to look at this for now. 

Vince