[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190402132200.GA23501@uranus>
Date: Tue, 2 Apr 2019 16:22:00 +0300
From: Cyrill Gorcunov <gorcunov@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: "Lendacky, Thomas" <Thomas.Lendacky@....com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Namhyung Kim <namhyung@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Jiri Olsa <jolsa@...hat.com>, Vince Weaver <vince@...ter.net>,
Stephane Eranian <eranian@...gle.com>
Subject: Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency
On Tue, Apr 02, 2019 at 03:03:02PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 01, 2019 at 09:46:33PM +0000, Lendacky, Thomas wrote:
> > This patch series addresses issues with increased NMI latency in newer
> > AMD processors that can result in unknown NMI messages when PMC counters
> > are active.
> >
> > The following fixes are included in this series:
> >
> > - Resolve a race condition when disabling an overflowed PMC counter,
> > specifically when updating the PMC counter with a new value.
> > - Resolve handling of active PMC counter overflows in the perf NMI
> > handler and when to report that the NMI is not related to a PMC.
> > - Remove earlier workaround for spurious NMIs by re-ordering the
> > PMC stop sequence to disable the PMC first and then remove the PMC
> > bit from the active_mask bitmap. As part of disabling the PMC, the
> > code will wait for an overflow to be reset.
> >
> > The last patch re-works the order of when the PMC is removed from the
> > active_mask. There was a comment from a long time ago about having
> > to clear the bit in active_mask before disabling the counter because
> > the perf NMI handler could re-enable the PMC again. Looking at the
> > handler today, I don't see that as possible, hence the reordering. The
> > question will be whether the Intel PMC support will now have issues.
> > There is still support for using x86_pmu_handle_irq() in the Intel
> > core.c file. Did Intel have any issues with spurious NMIs in the past?
> > Peter Z, any thoughts on this?
>
> I can't remember :/ I suppose we'll see if anything pops up after these
> here patches. At least then we get a chance to properly document things.
>
> > Also, I couldn't completely get rid of the "running" bit because it
> > is used by arch/x86/events/intel/p4.c. An old commit comment that
> > seems to indicate the p4 code suffered the spurious interrupts:
> > 03e22198d237 ("perf, x86: Handle in flight NMIs on P4 platform").
> > So maybe that partially answers my previous question...
>
> Yeah, the P4 code is magic, and I don't have any such machines left, nor
> do I think does Cyrill who wrote much of that.
It was so long ago :) What I remember from the head is some of the counters
were borken on hardware level so that I had to use only one counter instead
of two present in the system. And there were spurious NMIs too. I think
we can move this "running" bit to per-cpu base declared inside p4 code
only, so get rid of it from cpu_hw_events?
> I have vague memories of the P4 thing crashing with Vince's perf_fuzzer,
> but maybe I'm wrong.
No, you're correct. p4 was crashing many times before we manage to make
it more-less stable. The main problem though that to find working p4 box
is really a problem.
> Ideally we'd find a willing victim to maintain that thing, or possibly
> just delete it, dunno if anybody still cares.
As to me, I would rather mark this p4pmu code as deprecated, until there
is *real* need for its support.
>
> Anyway, I like these patches, but I cannot apply since you send them
> base64 encoded and my script chokes on that.
Powered by blists - more mailing lists