lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMou1-2Wgs9nvw9w8CHoacix8YSVAmuKu4MZJC=QQuCbLAx72w@mail.gmail.com>
Date:	Thu, 21 May 2015 00:17:48 +0100
From:	Robert Bragg <robert@...bynine.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	intel-gfx@...ts.freedesktop.org,
	Daniel Vetter <daniel.vetter@...el.com>,
	Jani Nikula <jani.nikula@...ux.intel.com>,
	David Airlie <airlied@...ux.ie>,
	Paul Mackerras <paulus@...ba.org>,
	Ingo Molnar <mingo@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
	linux-api@...r.kernel.org
Subject: Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

On Tue, May 19, 2015 at 3:53 PM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
>> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz@...radead.org> wrote:
>> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
>> >
>> >> I've changed the uapi for configuring the i915_oa specific attributes
>> >> when calling perf_event_open(2) whereby instead of cramming lots of
>> >> bitfields into the perf_event_attr config members, I'm now
>> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
>> >> config member that's extensible and validated in the same way as the
>> >> perf_event_attr struct. I've found this much nicer to work with while
>> >> being neatly extensible too.
>> >
>> > This worries me a bit.. is there more background for this?
>>
>> Would it maybe be helpful to see the before and after? I had kept this
>> uapi change in a separate patch for a while locally but in the end
>> decided to squash it before sending out my updated series.
>>
>> Although I did find it a bit awkward with the bitfields, I was mainly
>> concerned about the extensibility of packing logically separate
>> attributes into the config members and had heard similar concerns from
>> a few others who had been experimenting with my patches too.
>>
>> A few simple attributes I can think of a.t.m that we might want to add
>> in the future are:
>> - control of the OABUFFER size
>> - a way to ask the kernel to collect reports at the beginning and end
>> of batch buffers, in addition to periodic reports
>> - alternative ways to uniquely identify a context to support tools
>> profiling a single context not necessarily owned by the current
>> process
>>
>> It could also be interesting to expose some counter configuration
>> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
>> EU' counters included in the OA unit reports, each with a 16bit
>> configuration.
>>
>> In a more extreme case it might also be useful to allow userspace to
>> specify a complete counter config, which (depending on the
>> configuration) could be over 100 32bit values to select the counter
>> signals + configure the corresponding combining logic.
>>
>> Since this pmu is in a device driver it also seemed reasonably
>> appropriate to de-couple it slightly from the core perf_event_attr
>> structure by allowing driver extensible attributes.
>>
>> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
>> were instead a re-usable utility perhaps maintained in events/core.c,
>> so if other pmu drivers were to follow suite there would be less risk
>> of a mistake being made here?
>
>
> So I had a peek at:
>
>   https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
>
> In an attempt to inform myself of how the hardware worked. But the
> document is rather sparse (and does not include broadwell).

Thanks. Sorry I can't easily fix this myself, but there is work
ongoing to update this documentation. In the interim I can try and
cover some of the key details here...

>
> So from what I can gather there's two ways to observe the counters,
> through MMIO or trough the ring-buffer, which in turn seems triggered by
> a MI_REPORT_PERF_COUNT command.

There are three ways; mmio, MI_REPORT_PERF_COUNT via the ring-buffer
and periodic sampling where the HW writes into a circular 'oabuffer'.

I think it's best to discount mmio as a debug feature since none of
the counters are useful in isolation and mmio subverts the latching
mechanism we get with the other two methods. We typically at least
want to relate a counter to the number of clock cycles elapsed or the
gpu time, or thread spawn counts.

This pmu  driver is primarily for exposing periodic metrics, while
it's up to applications to choose to emit MI_REPORT_PERF_COUNT
commands as part of their command streams so I think we can mostly
ignore MI_REPORT_PERF_COUNT here too.

>
> [ Now the document talks about shortcomings of this scheme, where the
> MI_REPORT_PERF_COUNT command can only be placed every other command, but
> a single command can contain so much computation that this is not fine
> grained enough -- leading to the suggestion of a timer/cycle based
> reporting, but that is not actually mentioned afaict ]

It's in there, though unfortunately the documentation isn't very clear
currently. The 'Performance Event Counting' section seems to be the
appropriate place to introduce the periodic sampling feature, but just
reading it myself it really only talks about the limitations of
reporting like you say. I'll see if I can prod to get this improved.

If you see page 18 "Performance Statistics Registers":

OACONTROL has a 'Timer Period' field and 'Timer Enable'
OABUFFER points to a circular buffer for periodic reports
OASTATUS1/2 contain the head/tail pointers


>
> Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
> which events it will write out.
>
> This covers the regular 'A' counters. Is this correct?

There isn't a counter select in MI_REPORT_PERF_COUNT commands. I guess
you either saw 'GTT select' which is for dealing with the different
address spaces that the destination buffer can reside in, or in the
overview you saw the mention of the 'Counter Select' field following
the description of MI_REPORT_PERF_COUNT but just missed that it was
referring to the OACONTROL register.

The counter configuration is shared by all three modes of reading the
counters. (Though beware of confusing terms here, imho the above
motioned 'Counter Select' would be better described as a 'Report
Format' which is only relevant to MI_REPORT_PERF_COUNT and periodic
sampling, not mmio reading)

The A counters have fixed semantics on Haswell so there's nothing to
configure with these. On Broadwell some of these do become
configurable as "Flexible EU Counters" which is something for us to
keep in mind for later.

The semantics of the B counters are determined through the
configuration of a MUX to select signals of interest and the
configuration of some fixed-function ('boolean') logic that can affect
how those signals are combined before feeding into one of the B
counters. This configuration process is relatively slow, involving in
the order of 100 register writes.

After configuring the counter semantics, then for periodic and
MI_REPORT_PERF_COUNT usage (which work in terms of reports), we have
to choose the layout of those reports.

OACONTROL has a report format field ('Counter Select') that gives us a
limited way to trade off how many A or B counters are included, and
the memory bandwidth consumed writing out reports. So it doesn't
affect the configuration of counters per-se, it's just a way to ignore
some of the currently configured counters by selecting smaller
reports. Just looking at the OACONTROL documentation though I see
something amiss, as this field isn't only pre-Haswell. The possible
formats/layouts for Haswell are documented on page 10.


>
> Then there are the 'B' counters, which appear to be programmable through
> the CEC MMIO registers.

The fixed-function logic affecting B counters is configured via those
CEC registers as well as some per-counter 'report trigger' and 'start
trigger' registers not currently described. Overall though programming
the B counters is a combination of the fixed-function logic and the
MUX configuration that determines the signals that feed into this
logic, before counting.

>
> These B events can also be part of the MI_REPORT_PERF_COUNT vector.
>
> Right?

Right, part of MI_REPORT_PERF_COUNT and periodic reports.

>
>
> So for me the 'natural' way to represent this in perf would be through
> event groups. Create a perf event for every single event -- yes this is
> 53 events.

So when I was first looking at this work I had considered the
possibility of separate events, and these are some of the things that
in the end made me forward the hardware's raw report data via a single
event instead...

There are 100s of possible B counters depending on the MUX
configuration + fixed-function logic in addition to the A counters. A
choice would need to be made about whether to expose events for the
configurable counters that aren't inherently associated with any
semantics, or instead defining events for counters with specific
semantics (with 100s of possible counters to define). The later would
seem more complex for userspace and the kernel if they both now have
to understand the constraints on what counters can be used together. I
guess with either approach we would also need to have some form of
dedicated group leader event accepting attributes for configuring the
state that affects the group as a whole, such as the counter
configuration (3D vs GPGPU vs media etc). I'm not sure where we would
handle the context-id + drm file descriptor attributes for initiating
single context profiling but guess we'd need to authenticate each
individual event open. It's not clear if we'd configure the report
layout via the group leader, or try to automatically choose the most
compact format based on the group members. I'm not sure how pmus
currently handle the opening of enabled events on an enabled group but
I think there would need to be limitations in our case that new
members can't result in a reconfigure of the counters if that might
loose the current counter values known to userspace.

>From a user's pov, there's no real freedom to mix and match which
counters are configured together, and there's only some limited
ability to ignore some of the currently selected counters by not
including them in reports.

Something to understand here is that we have to work with sets of
pre-defined MUX + fixed-function logic configurations that have been
validated to give useful metrics for specific use cases, such as
benchmarking 3D rendering, GPGPU or media workloads.

As it is currently the kernel doesn't need to know anything about the
semantics of individual counters being selected, so it's currently
convenient that we can aim to maintain all the counter meta data we
have in userspace according to the changing needs of tools or drivers
(e.g. names, descriptions, units, max values, normalization
equations), de-coupled from the kernel, instead of splitting it
between the kernel and userspace.

A benefit of being able to change the report size is to reduce memory
bandwidth usage that can skew measurements. It's possible to request
the gpu to write out periodic snapshots at a very high frequency (we
can program a period as low as 160 nanoseconds) and higher frequencies
can start to expose some interesting details about how the gpu is
utilized - though with notable observer effects too. How careful we
are to not waste bandwidth is expected to determine what sampling
resolutions we can achieve before significantly impacting what we are
measuring.

Splitting the counters up looked like it could increase the bandwidth
we use quite a bit. The main difference comes from requiring 64bit
values instead of the 32bit values in our raw reports. This can be
offset partly since there are quite a few 'reserved'/redundant A
counters that don't need forwarding. As an example in the most extreme
case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
A counters (ignoring redundant ones) + 16 configurable counters = 400
byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
could choose to forward only 2 or 3 counters of interest at these high
frequencies which isn't possible currently.

>
> Use the MMIO reads for the regular read() interface, and use a hrtimer
> placing MI_REPORT_PERF_COUNT commands, with a counter select mask
> covering the all events in the current group, for sampling.

Unfortunately due to the mmio limitations and the need to relate
counters I can't imagine many use cases for directly accessing the
counters individually via the read() interface.

MI_REPORT_PERF_COUNT commands are really only intended for collecting
reports in sync with a command stream. We are experimenting currently
with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
commands automatically around the batches of commands submitted by
userspace so we can do a better job of filtering metrics across many
gpu contexts, but for now the expectation is that the kernel shouldn't
be emitting MI_REPORT_PERF_COUNT commands. We emit
MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
GL_INTEL_performance_query extension, at the start and end of a query
around a sequence of commands that the application is interested in
measuring.

>
> Then obtain the vector values using PERF_SAMPLE_READ and
> PERF_FORMAT_READ -- and use the read txn support to consume the
> ring-buffer vectors instead of the MMIO registers.

It sounds potentially doable to consume periodic OABUFFER reports via
the read_txn support.

>
> You can use the perf_event_attr::config to select the counter (A0-A44,
> B0-B7) and use perf_event_attr::config1 (low and high dword) for the
> corresponding CEC registers.
>

Hopefully covered above, but since the fixed-function state is so
dependent on the MUX configuration I think it currently makes sense to
treat the MUX plus logic state (including the CEC state) a tightly
coupled unit.

The Flexible EU Counters for Broadwell+ could be more amenable to this
kind of independent configuration, as I don't believe they are
dependant on the MUX configuration.

One idea that's come up a lot though is having the possibility of
being able to configure an event with a full MUX + fixed-function
state description.

>
> This does not require random per driver ABI extentions for
> perf_event_attr, nor your custom output format.
>
> Am I missing something obvious here?

Definitely nothing 'obvious' since the current documentation is
notably incomplete a.t.m, but I don't think we were on the same page
about how the hardware works and our use cases.

Hopefully some of my above comments help clarify some details.

I think I'll hold off from looking at changing anything specific until
you've had a chance to read my comments above in case you have more
thoughts and feedback.


Thanks for looking at this,
- Robert
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ