[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1608222057360.29115@macbook-air>
Date: Mon, 22 Aug 2016 21:02:39 -0400 (EDT)
From: Vince Weaver <vincent.weaver@...ne.edu>
To: Huang Rui <ray.huang@....com>
cc: Peter Zijlstra <peterz@...radead.org>,
Vince Weaver <vincent.weaver@...ne.edu>,
linux-kernel@...r.kernel.org, Borislav Petkov <bp@...e.de>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>
Subject: Re: perf: fuzzer crashes immediately on AMD system
On Mon, 22 Aug 2016, Huang Rui wrote:
> Hi Peter, Vince
>
> On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > >
> > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > > falls over more or less immediately.
> > > >
> > > > This maps to variable_test_bit()
> > > > called by ctx = find_get_context(pmu, task, event);
> > > > in kernel/events/core.c:9467
> > > >
> > > > It happens quickly enough I can probably track down the exact event that
> > > > causes this, if needed.
> > >
> > > I have a one line reproducer:
> > >
> > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> >
> > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> > various manuals to see if I can spot the fail.
> >
> > Huang could you either prod someone at AMD or do yourself, audit the AMD
> > perf code for all the various new models?
>
> Actually, there might be some NBPMC event changes between model 0h-fh and
> model 10h-1fh. Below are the documents of these two processors:
>
> http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
> http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf
>
> In section 3.16, it describes usage of NB Performance Counter Events.
I don't think it's the hardware that's causing the problem.
I've wasted a lot more time on it, and finally figured out how the "bt"
instruction works, so the assembly more or less makes sense.
The problem is the per-cpu amd_uncore struct is being over-written with
kernel memory addresses.
This makes uncore[0]->cpu a large number (it's often, but not always, the
per-cpu address of uncore[1]->cpu) which leads to the GPF.
I can't figure out what piece of code is overwriting things though.
And to make things complicated, I think the
amd_uncore_find_online_sibling()
function is broken. The code could really use more commenting, but I
think it is designed so all siblings share one single amd_uncore
structure, but in practice it looks like this doesn't work due to the way
the list iterator works.
Vince
Powered by blists - more mailing lists