[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150521082448.GM15256@phenom.ffwll.local>
Date: Thu, 21 May 2015 10:24:48 +0200
From: Daniel Vetter <daniel@...ll.ch>
To: Robert Bragg <robert@...bynine.org>
Cc: Peter Zijlstra <peterz@...radead.org>,
dri-devel@...ts.freedesktop.org, David Airlie <airlied@...ux.ie>,
linux-api@...r.kernel.org, intel-gfx@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Paul Mackerras <paulus@...ba.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Daniel Vetter <daniel.vetter@...el.com>
Subject: Re: [Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via
perf PMU
On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
> On Tue, May 19, 2015 at 3:53 PM, Peter Zijlstra <peterz@...radead.org> wrote:
> > On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
> >> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz@...radead.org> wrote:
> >> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
> >> >
> >> >> I've changed the uapi for configuring the i915_oa specific attributes
> >> >> when calling perf_event_open(2) whereby instead of cramming lots of
> >> >> bitfields into the perf_event_attr config members, I'm now
> >> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> >> >> config member that's extensible and validated in the same way as the
> >> >> perf_event_attr struct. I've found this much nicer to work with while
> >> >> being neatly extensible too.
> >> >
> >> > This worries me a bit.. is there more background for this?
> >>
> >> Would it maybe be helpful to see the before and after? I had kept this
> >> uapi change in a separate patch for a while locally but in the end
> >> decided to squash it before sending out my updated series.
> >>
> >> Although I did find it a bit awkward with the bitfields, I was mainly
> >> concerned about the extensibility of packing logically separate
> >> attributes into the config members and had heard similar concerns from
> >> a few others who had been experimenting with my patches too.
> >>
> >> A few simple attributes I can think of a.t.m that we might want to add
> >> in the future are:
> >> - control of the OABUFFER size
> >> - a way to ask the kernel to collect reports at the beginning and end
> >> of batch buffers, in addition to periodic reports
> >> - alternative ways to uniquely identify a context to support tools
> >> profiling a single context not necessarily owned by the current
> >> process
> >>
> >> It could also be interesting to expose some counter configuration
> >> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
> >> EU' counters included in the OA unit reports, each with a 16bit
> >> configuration.
> >>
> >> In a more extreme case it might also be useful to allow userspace to
> >> specify a complete counter config, which (depending on the
> >> configuration) could be over 100 32bit values to select the counter
> >> signals + configure the corresponding combining logic.
> >>
> >> Since this pmu is in a device driver it also seemed reasonably
> >> appropriate to de-couple it slightly from the core perf_event_attr
> >> structure by allowing driver extensible attributes.
> >>
> >> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
> >> were instead a re-usable utility perhaps maintained in events/core.c,
> >> so if other pmu drivers were to follow suite there would be less risk
> >> of a mistake being made here?
> >
> >
> > So I had a peek at:
> >
> > https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
> >
> > In an attempt to inform myself of how the hardware worked. But the
> > document is rather sparse (and does not include broadwell).
>
> Thanks. Sorry I can't easily fix this myself, but there is work
> ongoing to update this documentation. In the interim I can try and
> cover some of the key details here...
Yeah the docs are decidely less than stellar and especially lack
examples for how to use the flexible counters properly :( A few additional
comments from me below.
> > So from what I can gather there's two ways to observe the counters,
> > through MMIO or trough the ring-buffer, which in turn seems triggered by
> > a MI_REPORT_PERF_COUNT command.
>
> There are three ways; mmio, MI_REPORT_PERF_COUNT via the ring-buffer
> and periodic sampling where the HW writes into a circular 'oabuffer'.
>
> I think it's best to discount mmio as a debug feature since none of
> the counters are useful in isolation and mmio subverts the latching
> mechanism we get with the other two methods. We typically at least
> want to relate a counter to the number of clock cycles elapsed or the
> gpu time, or thread spawn counts.
>
> This pmu driver is primarily for exposing periodic metrics, while
> it's up to applications to choose to emit MI_REPORT_PERF_COUNT
> commands as part of their command streams so I think we can mostly
> ignore MI_REPORT_PERF_COUNT here too.
Yeah this is a crucial bit here - userspace inserts MI_REPORT_PERF_COUNT
into the gpu batchbuffer. And the gpu stores the samples at the userspace
provided address in the per-process gpu address space. Which means that
the kernel never ever sees these samples, nor can it get (efficiently at
least) at the target memory.
The mid-batchbuffer placement of perf samples is important since wrapping
just the entire batch is way too coarse. And timer-based sampling isn't
aligned to the different cmds, which means you can't restrict per sampling
to e.g. only a specific shader. Right now the gpu doesn't provide any
means to restrict sampling except emitting a full pipeline flush and then
wraping the interesting draw commands with MI_PEPORT_PERF.
> > [ Now the document talks about shortcomings of this scheme, where the
> > MI_REPORT_PERF_COUNT command can only be placed every other command, but
> > a single command can contain so much computation that this is not fine
> > grained enough -- leading to the suggestion of a timer/cycle based
> > reporting, but that is not actually mentioned afaict ]
>
> It's in there, though unfortunately the documentation isn't very clear
> currently. The 'Performance Event Counting' section seems to be the
> appropriate place to introduce the periodic sampling feature, but just
> reading it myself it really only talks about the limitations of
> reporting like you say. I'll see if I can prod to get this improved.
>
> If you see page 18 "Performance Statistics Registers":
>
> OACONTROL has a 'Timer Period' field and 'Timer Enable'
> OABUFFER points to a circular buffer for periodic reports
> OASTATUS1/2 contain the head/tail pointers
>
>
> >
> > Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
> > which events it will write out.
> >
> > This covers the regular 'A' counters. Is this correct?
>
> There isn't a counter select in MI_REPORT_PERF_COUNT commands. I guess
> you either saw 'GTT select' which is for dealing with the different
> address spaces that the destination buffer can reside in, or in the
> overview you saw the mention of the 'Counter Select' field following
> the description of MI_REPORT_PERF_COUNT but just missed that it was
> referring to the OACONTROL register.
Clarification: For userspace-inserted MI_REPORT_PERF the gpu forces the
access to the per-process address space, i.e. the GTT select bit isn't
obeyed.
> The counter configuration is shared by all three modes of reading the
> counters. (Though beware of confusing terms here, imho the above
> motioned 'Counter Select' would be better described as a 'Report
> Format' which is only relevant to MI_REPORT_PERF_COUNT and periodic
> sampling, not mmio reading)
>
> The A counters have fixed semantics on Haswell so there's nothing to
> configure with these. On Broadwell some of these do become
> configurable as "Flexible EU Counters" which is something for us to
> keep in mind for later.
>
> The semantics of the B counters are determined through the
> configuration of a MUX to select signals of interest and the
> configuration of some fixed-function ('boolean') logic that can affect
> how those signals are combined before feeding into one of the B
> counters. This configuration process is relatively slow, involving in
> the order of 100 register writes.
>
> After configuring the counter semantics, then for periodic and
> MI_REPORT_PERF_COUNT usage (which work in terms of reports), we have
> to choose the layout of those reports.
>
> OACONTROL has a report format field ('Counter Select') that gives us a
> limited way to trade off how many A or B counters are included, and
> the memory bandwidth consumed writing out reports. So it doesn't
> affect the configuration of counters per-se, it's just a way to ignore
> some of the currently configured counters by selecting smaller
> reports. Just looking at the OACONTROL documentation though I see
> something amiss, as this field isn't only pre-Haswell. The possible
> formats/layouts for Haswell are documented on page 10.
The tricky bit to keep in mind is that userspace and the kernel need to be
somewhat in sync about which sampling format is selected. Otherwise the
address in the MI_REPORT_PERF generated by userspace will not increment by
the correct amount (there's unfortunately no nice autoincrement support
for this).
> > Then there are the 'B' counters, which appear to be programmable through
> > the CEC MMIO registers.
>
> The fixed-function logic affecting B counters is configured via those
> CEC registers as well as some per-counter 'report trigger' and 'start
> trigger' registers not currently described. Overall though programming
> the B counters is a combination of the fixed-function logic and the
> MUX configuration that determines the signals that feed into this
> logic, before counting.
>
> >
> > These B events can also be part of the MI_REPORT_PERF_COUNT vector.
> >
> > Right?
>
> Right, part of MI_REPORT_PERF_COUNT and periodic reports.
Imo that's a big reason for why the kernel should just forward the raw
samples to userspace. If we try to decode them in the kernel for the
timer-based ring, then userspace needs a dupe of that logic for the
MI_REPORT_PERF samples it collects.
If otoh we just forward the raw samples to userspace the same userspace
code can do the decoding for both the timer-based samples (collected
through perf) and the in-batch samples.
> > So for me the 'natural' way to represent this in perf would be through
> > event groups. Create a perf event for every single event -- yes this is
> > 53 events.
>
> So when I was first looking at this work I had considered the
> possibility of separate events, and these are some of the things that
> in the end made me forward the hardware's raw report data via a single
> event instead...
>
> There are 100s of possible B counters depending on the MUX
> configuration + fixed-function logic in addition to the A counters. A
> choice would need to be made about whether to expose events for the
> configurable counters that aren't inherently associated with any
> semantics, or instead defining events for counters with specific
> semantics (with 100s of possible counters to define). The later would
> seem more complex for userspace and the kernel if they both now have
> to understand the constraints on what counters can be used together. I
> guess with either approach we would also need to have some form of
> dedicated group leader event accepting attributes for configuring the
> state that affects the group as a whole, such as the counter
> configuration (3D vs GPGPU vs media etc). I'm not sure where we would
> handle the context-id + drm file descriptor attributes for initiating
> single context profiling but guess we'd need to authenticate each
> individual event open. It's not clear if we'd configure the report
> layout via the group leader, or try to automatically choose the most
> compact format based on the group members. I'm not sure how pmus
> currently handle the opening of enabled events on an enabled group but
> I think there would need to be limitations in our case that new
> members can't result in a reconfigure of the counters if that might
> loose the current counter values known to userspace.
>
> From a user's pov, there's no real freedom to mix and match which
> counters are configured together, and there's only some limited
> ability to ignore some of the currently selected counters by not
> including them in reports.
>
> Something to understand here is that we have to work with sets of
> pre-defined MUX + fixed-function logic configurations that have been
> validated to give useful metrics for specific use cases, such as
> benchmarking 3D rendering, GPGPU or media workloads.
>
> As it is currently the kernel doesn't need to know anything about the
> semantics of individual counters being selected, so it's currently
> convenient that we can aim to maintain all the counter meta data we
> have in userspace according to the changing needs of tools or drivers
> (e.g. names, descriptions, units, max values, normalization
> equations), de-coupled from the kernel, instead of splitting it
> between the kernel and userspace.
>
> A benefit of being able to change the report size is to reduce memory
> bandwidth usage that can skew measurements. It's possible to request
> the gpu to write out periodic snapshots at a very high frequency (we
> can program a period as low as 160 nanoseconds) and higher frequencies
> can start to expose some interesting details about how the gpu is
> utilized - though with notable observer effects too. How careful we
> are to not waste bandwidth is expected to determine what sampling
> resolutions we can achieve before significantly impacting what we are
> measuring.
>
> Splitting the counters up looked like it could increase the bandwidth
> we use quite a bit. The main difference comes from requiring 64bit
> values instead of the 32bit values in our raw reports. This can be
> offset partly since there are quite a few 'reserved'/redundant A
> counters that don't need forwarding. As an example in the most extreme
> case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
> byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
> A counters (ignoring redundant ones) + 16 configurable counters = 400
> byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
> perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
> could choose to forward only 2 or 3 counters of interest at these high
> frequencies which isn't possible currently.
>
> >
> > Use the MMIO reads for the regular read() interface, and use a hrtimer
> > placing MI_REPORT_PERF_COUNT commands, with a counter select mask
> > covering the all events in the current group, for sampling.
>
> Unfortunately due to the mmio limitations and the need to relate
> counters I can't imagine many use cases for directly accessing the
> counters individually via the read() interface.
>
> MI_REPORT_PERF_COUNT commands are really only intended for collecting
> reports in sync with a command stream. We are experimenting currently
> with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
> commands automatically around the batches of commands submitted by
> userspace so we can do a better job of filtering metrics across many
> gpu contexts, but for now the expectation is that the kernel shouldn't
> be emitting MI_REPORT_PERF_COUNT commands. We emit
> MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
> GL_INTEL_performance_query extension, at the start and end of a query
> around a sequence of commands that the application is interested in
> measuring.
>
> >
> > Then obtain the vector values using PERF_SAMPLE_READ and
> > PERF_FORMAT_READ -- and use the read txn support to consume the
> > ring-buffer vectors instead of the MMIO registers.
>
> It sounds potentially doable to consume periodic OABUFFER reports via
> the read_txn support.
>
> >
> > You can use the perf_event_attr::config to select the counter (A0-A44,
> > B0-B7) and use perf_event_attr::config1 (low and high dword) for the
> > corresponding CEC registers.
> >
>
> Hopefully covered above, but since the fixed-function state is so
> dependent on the MUX configuration I think it currently makes sense to
> treat the MUX plus logic state (including the CEC state) a tightly
> coupled unit.
>
> The Flexible EU Counters for Broadwell+ could be more amenable to this
> kind of independent configuration, as I don't believe they are
> dependant on the MUX configuration.
>
> One idea that's come up a lot though is having the possibility of
> being able to configure an event with a full MUX + fixed-function
> state description.
>
> >
> > This does not require random per driver ABI extentions for
> > perf_event_attr, nor your custom output format.
> >
> > Am I missing something obvious here?
>
> Definitely nothing 'obvious' since the current documentation is
> notably incomplete a.t.m, but I don't think we were on the same page
> about how the hardware works and our use cases.
>
> Hopefully some of my above comments help clarify some details.
>
> I think I'll hold off from looking at changing anything specific until
> you've had a chance to read my comments above in case you have more
> thoughts and feedback.
Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists