[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161227231049.GT26852@two.firstfloor.org>
Date: Tue, 27 Dec 2016 15:10:49 -0800
From: Andi Kleen <andi@...stfloor.org>
To: David Carrillo-Cisneros <davidcc@...gle.com>
Cc: Andi Kleen <andi@...stfloor.org>,
Shivappa Vikas <vikas.shivappa@...el.com>,
Peter Zijlstra <peterz@...radead.org>,
Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
x86 <x86@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
"Shankar, Ravi V" <ravi.v.shankar@...el.com>,
"Luck, Tony" <tony.luck@...el.com>,
Fenghua Yu <fenghua.yu@...el.com>,
Stephane Eranian <eranian@...gle.com>, hpa@...or.com
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation
On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
> ~1170ns
>
> most of the time is spend in cgroup ctx switch (~1120ns) .
>
> When using continuous monitoring in CQM driver, the avg time to
> find the rmid to write inside of pqr_context switch is ~16ns
>
> Note that this excludes the MSR write. It's only the overhead of
> finding the RMID
> to write in PQR_ASSOC. Both paths call the same routine to find the
> RMID, so there are
> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
> of it comes from iterating over the pmu list.
Do Kan's pmu list patches help?
https://patchwork.kernel.org/patch/9420035/
>
> > Or is there some other overhead other than the MSR write
> > you're concerned about?
>
> No, that problem is solved with the PQR software cache introduced in the series.
So it's already fixed?
How much is the cost with your cache?
>
>
> > Perhaps some optimization could be done in the code to make it faster,
> > then the new interface wouldn't be needed.
>
> There are some. One in my list is to create a list of pmus with at
> least one cgroup event
> and use it to iterate over in perf_cgroup_switch, instead of using the
> "pmus" list.
> The pmus list has grown a lot recently with the addition of all the uncore pmus.
Kan's patches above already do that I believe.
>
> Despite this optimization, it's unlikely that the whole sched_out +
> sched_in gets that
> close to the 15 ns of the non perf_event approach.
It would be good to see how close we can get. I assume
there is more potential for optimizations and fast pathing.
-Andi
Powered by blists - more mailing lists