linux-kernel - [RFC] perf_events: support for uncore a.k.a. nest units

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B560ACD.4040206@linux.vnet.ibm.com>
Date:	Tue, 19 Jan 2010 11:41:01 -0800
From:	Corey Ashford <cjashfor@...ux.vnet.ibm.com>
To:	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
	Andi Kleen <andi@...stfloor.org>,
	Paul Mackerras <paulus@...ba.org>,
	Stephane Eranian <eranian@...glemail.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Xiao Guangrong <xiaoguangrong@...fujitsu.com>
CC:	Dan Terpstra <terpstra@...s.utk.edu>,
	Philip Mucci <mucci@...s.utk.edu>,
	Maynard Johnson <mpjohn@...ibm.com>, Carl Love <cel@...ibm.com>
Subject: [RFC] perf_events: support for uncore a.k.a. nest units

-----
Intro
-----
One subject that hasn't been addressed since the introduction of perf_events in 
the Linux kernel is that of support for "uncore" or "nest" unit events.  Uncore 
is the term used by the Intel engineers for their off-core units but are still 
on the same die as the cores, and "nest" means exactly the same thing for IBM 
Power processor engineers.  I will use the term uncore for brevity and because 
it's in common parlance, but the issues and design possibilities below are 
relevant to both.  I will also broaden the term by stating that uncore will also 
refer to PMUs that are completely off of the processor chip altogether.

Contents
--------
1. Why support PMUs in uncore units?  Is there anything interesting to look at?
2. How do uncore events differ from core events?
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
4. How do you encode uncore events?
5. How do you address a particular uncore PMU?
6. Event rotation issues with uncore PMUs
7. Other issues?
8. Feedback?

----
1. Why support PMUs in uncore units?  Is there anything interesting to look at?
----

Today, many x86 chips contain uncore units, and we think that it's likely that 
the trend will continue, as more devices - I/O, memory interfaces, shared 
caches, accelerators, etc. - are integrated onto multi-core chips.  As these 
devices become more sophisticated and more workload is diverted off-core, 
engineers and performance analysts are going to want to look at what's happening 
in these units so that they can find bottlenecks.

In addition, we think that even off-chip I/O and interconnect devices are likely 
to gain PMUs because engineers will want to find bottlenecks in their massively 
parallel systems.

----
2. How do uncore events differ from core events?
----

The main difference is that uncore events are mostly likely not going to be tied 
to a particular Linux task, or even a CPU context.  Uncore units are resources 
that are in some sense system-wide, though, they may not really be accessible 
system-wide in some architectures.  In the case of accelerators and I/O devices, 
it's likely they will run asynchronously from the cores, and thus keeping track 
of events on a per-task basis doesn't make a lot of sense.  The other existing 
mode in perf_events is a per-CPU context, and it turns out that this mode does 
match up with uncore units well, though the choice of which CPU to use to manage 
that uncore unit is going to need to be arch-dependent and may involve other 
issues as well, such as minimizing access latency between the uncore unit and 
the CPU which is managing it.

----
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
----

* The control registers of the uncore unit's PMU need to be read and written, 
and that may be possible only from a subset of processors in the system.
* A processor is needed to rotate the event list on the uncore unit on every 
tick for the purposes of event scheduling.
* Because of access latency issues, we may want the CPU to be close in locality 
to the PMU.

It seems like a good idea to let the kernel decide which CPU to use to monitor a 
particular uncore event, based on the location of the uncore unit, and possibly 
current system load balance.  The user will not want to have to figure out this 
detailed information.

----
4. How do you encode uncore events?
----
Uncore events will need to be encoded in the config field of the perf_event_attr 
struct using the existing PERF_TYPE_RAW encoding.  64 bits are available in the 
config field, and that may be sufficient to support events on most systems. 
However, due to  the proliferation and added complexity of PMUs we envision, we 
might want to add another 64-bit config (perhaps call it config_extra or 
config2) field to encode any extra attributes that might be needed.  The exact 
encoding used, just as for the current encoding for core events, will be on a 
per-arch and possibly per-system basis.

----
5. How do you address a particular uncore PMU?
----

This one is going to be very system- and arch-dependent, but it seems fairly 
clear that we need some sort of addressing scheme that can be 
system/arch-defined by the kernel.

 From a hierarchical perspective, here's an example of possible uncore PMU 
locations in a large system:

1) Per-core - units that are shared between all hardware threads in a core
2) Per-node - units that are shared between all cores in a node
3) Per-chip - units that are shared between all nodes in a chip
4) Per-blade - units that are shared between all chips on a blade
5) Per-rack - units that are shared between all blades in a rack

Addressing option 1)

Reuse the cpu argument: cpu would be interpreted differently if an uncore unit 
is specified (via the perf_event_attr struct's config field).

For the hypothetical system described above, we'd want to have an address that 
contains enough address bits for each of the above.  For example:

bits   field
------ -----
3..0   PMU number 0-15  /* specifies which of several identical PMUs being 
addressed */
7..4   core id 0-15
8..8   node id 0-1
11..9  chip id 0-7
16..12 blade id 0-31
23..17 rack id 0-128

These fields would be exposed via /usr/include/linux/perf_events_uncore_addr.h 
(for example).  How you actually assign these numbers to actual hardware is, 
again, system-design dependent, and may be influenced by the use of a 
hypervisor, or other software which allocates resources available to the system 
dynamically.

How does the user discover the mapping between the hardware made available to 
the system and the addresses shown above?  Again, this is system-dependent, and 
probably outside the scope of this proposal.  In other words, I don't know how 
to do this in a general way, though I could probably put something together for 
a particular system.

Addressing Option 2)

Have the kernel create nodes for each uncore PMU in /sys/devices/system or other 
pseudo file system, such as the existing  /proc/device-tree on Power systems. 
/sys/devices/system or /proc/device-tree could be explored by the
user tool, and the user could then specify the path of the requested PMU via a 
string which the kernel could interpret.  To be overly simplistic, something 
like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1".  If we settled on a 
common tree root to use, we could specify only the relative path name, 
"blade4/cpu0/vectorcopro1".

One way to provide this extra "PMU path" argument to the sys_perf_event_open() 
would be to add a bit to the flags argument says we're adding a PMU path string 
onto the end of the argument list.

This path-string-based addressing option seems to more flexible in the long run, 
and does not have as serious of an issue in mapping PMUs to user space; the 
kernel essentially exposes to user space all of the available PMUs for the 
current partition.   This might create more work for the kernel side, but should 
make the system more transparent for user-space tools.  Another system- or at 
least arch-dependent tool would have to be written for user space to help users 
navigate the device tree to find the PMU they want to use.  I don't think it 
would make sense to build that capability into perf, because the software would 
be arch- or system-dependent.

It could be argued that we should use a common user space tree to represent PMUs 
for all architectures and systems, so that the arch-independent perf code would 
be able to display available uncore PMUs.  That may be a goal that's very hard 
to achieve because of the wide variation in architectures.  Any thoughts on that?

----
6. Event rotation issues with uncore PMUs
----

Currently, the perf_events code rotates the set of events assigned to a CPU or 
task on every system tick, so that event scheduling collisions on a PMU are 
mitigated.  This turns out to cause problems for uncore units for two reasons - 
inefficiency and CPU load.

a) Rotation of a set of events across more than one PMU causes inefficient rotation.

Consider the following event list; the letter designates the PMU and the number 
is the event number on that PMU.
A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5

after one rotation, you can see that the event list will be:

C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

and then

C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3

Notice how the relative positions for the A and B PMU events haven't changed 
even after two (or even five) rotations, so they will schedule the events in the 
same order for some time.  This will skew the multiplexing so that some events 
will be scheduled much less often than they should or could be.

What we'd like to have happen is that events for each PMU be rotated in their 
own lists.  For example, before rotation:

A1 A2 A3
B1 B2 B3 B4
C1 C2 C3 C4 C5

After rotation:

A3 A1 A2
B2 B3 B4 B1
C2 C3 C4 C5 C1

We've got some ideas about how to make this happen, using either separate lists, 
or placing them on separate CPUs.

b) Access to some PMU uncore units may be quite slow due to the interconnect 
that is used.  This can place a burden on the CPU if it is done every system tick.

This can be addressed by keeping a counter, on a per-PMU context basis that 
reduces the rate of event rotations.  Setting the rotation period to three, for 
example, would cause event rotations in that context to happen on every third 
tick, instead of every tick.  We think that the kernel could measure the amount 
of time it is taking to do a rotate, and then dynamically decrease the rotation 
rate if it's taking too long; "rotation rate throttling" in other words.

----
7. Other issues?
----

This section left blank for now.

----
8. Feedback?
----

I'd appreciate any feedback you might have on this topic.  You can contact me 
directly at the email address below, or better yet, reply to LKML.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@...ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/