[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151016094306.GA3604@twins.programming.kicks-ass.net>
Date: Fri, 16 Oct 2015 11:43:06 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Robert Bragg <robert@...bynine.org>
Cc: intel-gfx@...ts.freedesktop.org,
Daniel Vetter <daniel.vetter@...el.com>,
Chris Wilson <chris@...is-wilson.co.uk>,
Sourab Gupta <sourab.gupta@...el.com>,
Zhenyu Wang <zhenyuw@...ux.intel.com>,
Jani Nikula <jani.nikula@...ux.intel.com>,
David Airlie <airlied@...ux.ie>,
Ingo Molnar <mingo@...nel.org>,
Kan Liang <kan.liang@...el.com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Zheng Yan <zheng.z.yan@...el.com>,
Mark Rutland <mark.rutland@....com>,
Matt Fleming <matt.fleming@...el.com>,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-api@...r.kernel.org
Subject: Re: [RFC 0/6] Non perf based Gen Graphics OA unit driver
On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> - We're bridging two complex architectures
>
> To review this work I think it will be relevant to have a good
> general familiarity with Gen graphics (e.g. thinking about the OA
> unit's interaction with the command streamer and execlist
> scheduling) as well as our userspace architecture and how we're
> consuming OA data within Mesa to implement the
> INTEL_performance_query extension.
>
> On the flip side here, its necessary to understand the perf
> userspace interface (for most this is hidden by tools so the details
> aren't common knowledge) as well as the internal design, considering
> that the PMU we're looking at seems to break several current design
> assumptions. I can only claim a limited familiarity with perf's
> design, just as a result of this work.
Right; but a little effort and patience on both sides should get us
there I think. At worst we'll both learn something new ;-)
> - The current OA PMU driver breaks some significant design assumptions.
>
> Existing perf pmus are used for profiling work on a cpu and we're
> introducing the idea of _IS_DEVICE pmus with different security
> implications, the need to fake cpu-related data (such as user/kernel
> registers) to fit with perf's current design, and adding _DEVICE
> records as a way to forward device-specific status records.
There are more devices with counters on than GPUs, so I think it might
make sense to look at extending perf to better deal with this.
> The OA unit writes reports of counters into a circular buffer,
> without involvement from the CPU, making our PMU driver the first of
> a kind.
Agreed, this is somewhat 'odd' from where we are today.
> Perf supports groups of counters and allows those to be read via
> transactions internally but transactions currently seem designed to
> be explicitly initiated from the cpu (say in response to a userspace
> read()) and while we could pull a report out of the OA buffer we
> can't trigger a report from the cpu on demand.
>
> Related to being report based; the OA counters are configured in HW
> as a set while perf generally expects counter configurations to be
> orthogonal. Although counters can be associated with a group leader
> as they are opened, there's no clear precedent for being able to
> provide group-wide configuration attributes and no obvious solution
> as yet that's expected to be acceptable to upstream and meets our
> userspace needs.
I'm not entirely sure what you mean with group-wide configuration
attributes; could you elaborate?
> We currently avoid using perf's grouping feature
> and forward OA reports to userspace via perf's 'raw' sample field.
> This suits our userspace well considering how coupled the counters
> are when dealing with normalizing. It would be inconvenient to split
> counters up into separate events, only to require userspace to
> recombine them.
So IF you were using a group, a single read from the leader can return
you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
do that recombine.
Another option would be to view the arrival of an OA vector in the
datastream as an 'event' and generate a PERF_RECORD_READ in the perf
buffer (which again can use the GROUP vector format).
> Related to counter orthogonality; we can't time share the OA unit,
> while event scheduling is a central design idea within perf for
> allowing userspace to open + enable more events than can be
> configured in HW at any one time.
So we have other PMUs that cannot do this; Gen OA would not be unique in
this. Intel PT for example only allows a single active event.
That said; earlier today I saw:
https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7
where exactly this feature was mentioned as not fitting well into the
existing GPU performance interfaces (GL_AMD_performance_monitor /
GL_INTEL_performance_query).
So there is hardware (Nvidia) out there that does support this. Also
mentioned was that this hardware has global and local counters, where
the local ones are specific to a rendering context. That is not unlike
the per-cpu / per-task stuff perf does.
> The OA unit is not designed to
> allow re-configuration while in use. We can't reconfigure the OA
> unit without loosing internal OA unit state which we can't access
> explicitly to save and restore. Reconfiguring the OA unit is also
> relatively slow, involving ~100 register writes. From userspace Mesa
> also depends on a stable OA configuration when emitting
> MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
> disabled while there are outstanding MI_RPC commands lest we hang
> the command streamer.
Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.
> - We may be making some technical compromises a.t.m for the sake of
> using perf.
>
> perf_event_open() requires events to either relate to a pid or a
> specific cpu core, while our device pmu relates to neither. Events
> opened with a pid will be automatically enabled/disabled according
> to the scheduling of that process - so not appropriate for us.
Right; the traditional cpu/pid mapping doesn't work well for devices;
but maybe, with some work, we can create something like that
global/local render context from it; although I've no clue what form
that would need at this time.
> When
> an event is related to a cpu id, perf ensures pmu methods will be
> invoked via an inter process interrupt on that core. To avoid
> invasive changes our userspace opens OA perf events for a specific
> cpu.
Some of that might still make sense in the sense that GPUs are subject
to the NUMA topology of machines. I would think you would want most
such things to be done on the node the device is attached to.
Granted, this might not be a concern for Intel graphics, but it might be
relevant for some of the discrete GPUs.
> - I'm not confident our use case benefits much from building on perf:
>
> We aren't using existing perf based tooling with our PMU. Existing
> tools typically assume you're profiling work running on a cpu, e.g.
> expecting samples to be associated with instruction pointers and
> user/kernel registers and aiming to represent metrics in relation
> to application source code. We're forwarding fake register values
> and userspace needs needs to know how to decode the raw OA reports
> before anything can be reported to a user.
>
> With the buffering done by the OA unit I don't think we currently
> benefit from perf's mmapped circular buffer interface. We already
> have a decoupled producer and consumer and since we have to copy out
> of the OA buffer, it would work well for us to hide that copy in
> a simpler read() based interface.
>
>
> - Logistically it might be more practical to contain this to the
> graphics stack.
>
> It seems fair to consider that if we can't see a very compelling
> benefit to building on perf, then containing this work to
> drivers/gpu/drm/i915 may simplify the review process as well as
> future maintenance and development.
> Peter; I wonder if you would tend to agree too that it could make sense
> for us to go with our own interface here?
Sorry this took so long; this wanted a well considered response and
those tend to get delayed in light of 'urgent' stuff.
While I can certainly see the pain points and why you would rather not
deal with them. I think it would make Linux a better place if we could
manage to come up with a generic interface that would work for 'all'
GPUs (and possibly more devices).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists