linux-kernel - Re: perf: fuzzer crashes immediately on AMD system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1608222057360.29115@macbook-air>
Date:   Mon, 22 Aug 2016 21:02:39 -0400 (EDT)
From:   Vince Weaver <vincent.weaver@...ne.edu>
To:     Huang Rui <ray.huang@....com>
cc:     Peter Zijlstra <peterz@...radead.org>,
        Vince Weaver <vincent.weaver@...ne.edu>,
        linux-kernel@...r.kernel.org, Borislav Petkov <bp@...e.de>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>
Subject: Re: perf: fuzzer crashes immediately on AMD system

On Mon, 22 Aug 2016, Huang Rui wrote:

> Hi Peter, Vince
> 
> On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > > 
> > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > > falls over more or less immediately.
> > > > 
> > > > This maps to variable_test_bit()
> > > > 	called by ctx = find_get_context(pmu, task, event);
> > > > 		in kernel/events/core.c:9467
> > > > 
> > > > It happens quickly enough I can probably track down the exact event that 
> > > > causes this, if needed.
> > > 
> > > I have a one line reproducer:
> > > 
> > > 	perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> > 
> > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> > various manuals to see if I can spot the fail.
> > 
> > Huang could you either prod someone at AMD or do yourself, audit the AMD
> > perf code for all the various new models?
> 
> Actually, there might be some NBPMC event changes between model 0h-fh and
> model 10h-1fh. Below are the documents of these two processors:
> 
> http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
> http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf
> 
> In section 3.16, it describes usage of NB Performance Counter Events.

I don't think it's the hardware that's causing the problem.

I've wasted a lot more time on it, and finally figured out how the "bt" 
instruction works, so the assembly more or less makes sense.

The problem is the per-cpu amd_uncore struct is being over-written with 
kernel memory addresses.

This makes uncore[0]->cpu a large number (it's often, but not always, the 
per-cpu address of uncore[1]->cpu) which leads to the GPF.

I can't figure out what piece of code is overwriting things though.

And to make things complicated, I think the 
	amd_uncore_find_online_sibling()
function is broken.  The code could really use more commenting, but I 
think it is designed so all siblings share one single amd_uncore 
structure, but in practice it looks like this doesn't work due to the way 
the list iterator works.

Vince