lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250521225049.132551-1-tony.luck@intel.com>
Date: Wed, 21 May 2025 15:50:18 -0700
From: Tony Luck <tony.luck@...el.com>
To: Fenghua Yu <fenghuay@...dia.com>,
	Reinette Chatre <reinette.chatre@...el.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>,
	Peter Newman <peternewman@...gle.com>,
	James Morse <james.morse@....com>,
	Babu Moger <babu.moger@....com>,
	Drew Fustini <dfustini@...libre.com>,
	Dave Martin <Dave.Martin@....com>,
	Anil Keshavamurthy <anil.s.keshavamurthy@...el.com>,
	Chen Yu <yu.c.chen@...el.com>
Cc: x86@...nel.org,
	linux-kernel@...r.kernel.org,
	patches@...ts.linux.dev,
	Tony Luck <tony.luck@...el.com>
Subject: [PATCH v5 00/29] x86/resctrl telemetry monitoring

These patches are based on tip x86/cache branch. HEAD at time of
snapshot is:

54d14f25664b ("MAINTAINERS: Add reviewers for fs/resctrl")

These patches are also available in the rdt-aet-v5 branch at:

Link: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git

Changes (based on feedback from Reinette and a bug report from Chen Yu)
since v4 was posted here:

Link: https://lore.kernel.org/all/20250429003359.375508-1-tony.luck@intel.com/

Change map indexed by patch numbers in v4. Some patches have been merged,
split, dropped, or re-ordered. The v5 patch numbers are referred to
by their 4-digit git format-patch numbers in an attempt to avoid
confusion.

=== 1 ===

v4 patch was focussed on removing rdt_mon_features bitmap
which included moving all the mon_evt structure definitions
into an array. Reinette noted that this array would mean the
rdt_resource::evt_list is no longer needed.

v5 splits up the changes into three parts:
0001:	Moves mon_evt structures into an array (now named
	mon_event_all[]) and replaces use of rdt_resource::evt_list
	with iteration of enabled events in the array.
0002:	Replace resctrl_arch_is_llc_occupancy_enabled() with
	resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)
	(ditto for the mbm*enabled() inline functions)
0003:	Remove remaining use of rdt_mon_features.

=== 2 ===

0004:	Fix typos. Change parameter of resctrl_is_mbm_event() to enum.
	s/QOS_NUM_MBM_EVENTS/QOS_NUM_L3_MBM_EVENTS/.
	s/MBM_EVENT_IDX/MBM_STATE_IDX/. Rewrite get_arch_mbm_state()
	in same simple style as get_mbm_state()

=== 3 ===

Dropped this patch. No immediate plans for other mbm monitor events
that could be used as input to the "mba_MBps" feedback control.

=== 4 ===

Also dropped. The rdt_resource::evt_list no longer exists, so no need
to rearrange code to build it at mount time.

=== 5 ===

Dropped the Kconfig changes for now. This means that intel_aet.c is
always built with CONFIG_X86_CPU_RESCTRL=y. Will need to revisit when
the CONFIG_INTEL_PMT_DISCOVERY driver is upstream.

=== 6 ===

0005:	Added comments that fake interface is deliberately crafted
	with parameters to exercise multiple aggregators per package
	and insufficient RMIDs supported.

=== 7 ===

0006:	Rename check_domain_header() to domain_header_is_valid()

=== 8 ===

Split into two parts:
0007:	Better names for functions. Use "()" consistently in
	commit message when naming functions.
0008:	Better description that this change is just for domain add.

=== 9 ===

0009:	New commit message with background and rationale for change.

=== 10 ===

0010:	More context in commit message. Dropped an unnecessary container_of()
	Made domain_add_cpu_ctrl() match domain_add_cpu_mon() with simple
	path when adding a CPU to existing domain.

=== 11 ===

0011:	Added rational for rename of rdt_mon_domain and rdt_hw_mon_domain
	structures. Fixed alignment in structure definitions. Fixed broken
	fir tree ordering.

=== 12 ===

Split into two parts:
0012:	Make mon_data and rmid_read structures point to mon_evt instead of
	just holding the event enum.
0013:	The "read from any cpu" part.
	Fixed bug reported by Chen Yu for use of smp_processor_id()
	New shortlog description. Fixed "cpumast" typo. Separated
	problem description from solution.
	Fixed reverse fir tree.
	Avoid "usually" comment (new comments in the helper function
	that moved out of __mon_event_count().

=== 13 ===

0014:	New direction. Don't bind specific value display formats
	to specific events, limiting other architectures to follow in
	the footsteps for the first to implement an event. Instead
	allow architecture to specify how many binary fixed-point bits
	are used for each event.

=== 14 ===

0015:	Add period to end of sentence in comment for resctrl_arch_pre_mount().
	Use atomic_try_cmpxchg() instead of atomic_cmpxchg().

=== 15 ===

0016:	Updated commit comment to avoid "with code".
	Dropped initialization of rdt_hw_resource::rid.


=== 16 ===
=== 17 ===
=== 18 ===
=== 20 ===
=== 21 ===
	These were "first part", "second part" ... of enumeration and
	the sanity check for adequate MMIO space to match expectations
	from the XML file layout description.

0017:
0018:
0019:
	Now describe what actions each part is doing. Building the
	struct event_group fields as needed for each patch.
	Split the fields into sets that are initialized from XML
	files, and fields used by resctrl code to manage groups.
	Fixed Link: lines with real URL to the Intel-PMT git repo.
	Changed type of guid from int to u32.
	Changed configure_events() return value from bool to standard
	integer error code (and use -ENOMEM, -EINVAL where appropriate).
	Document mmio_info structure and add an ascii art picture to the
	commit comment showing how it is used.
	Use kzalloc() instead of kmalloc()
	Add a helper function skip_this_region() so that counting
	regions and allocation for regions will do the same thing.


=== 19 ===

0020:	Add description of layout of MMIO counters to commit comment.

=== 22 ===

0021:	Fixed mmio address range check in intel_aet_read_event()
	Changed return code from -EINVAL to -EIO to meet expectations
	of rdtgroup_mondata_show().
	Changed name of VALID_BIT define to DATA_VALID to indicate that
	it shows that the value in a counter is valid (as opposed to the
	counter itself).
	Added check in resctrl_arch_rmid_read() that remainder of the
	function after the check for RDT_RESOURCE_PERF_PKG has been
	passed a RDT_RESOURCE_L3 resource.

=== 23 ===

0022:	Fix typo s/domsins/domains/
	Kept definition of struct rdt_perf_pkg_mon_domain in architecture
	code. Reinette comment "This may thus be ok like this for now".
	Since this only contains the rdt_domain_hdr, there isn't anything
	extra that file system code could look at even if it somehow
	wanted to.
	Things may be different for a more complex resource that has
	to maintain additional per-domain state that file system code
	may need to be aware of.

=== 24 ===

	I merged old patch 24 into new 0016

=== 25 ===

0023:	Unchanged

=== 26 ===

0024:	Split the hard-to-read rdt_check_option() function into two that
	have names that convey what they do: rdt_is_option_force_enabled()
	and rdt_is_option_force_disabled().

=== 27 ===

0025:	Updated commit comment and kerneldoc comment to note that
	pmt_event:num_rmids field is initialized from data in the
	XML file, but may be overwritten.
	Added min() operation to make sure num_rmids cannot be increased
	when processing additional event groups.

=== 28 ===

0026:	In V4 this added to the mount time initialization of the per-resource
	event lists. But that code has been dropped from this series. Added
	new one-time call in rdt_get_tree() (inside code where resctrl_mutex
	is held). Same basic function to compute the number of RMIDs as the
	minimum across all enabled monitor resources.

=== 29 ===
=== 30 ===

0027:
0028:	V4 presented these as RFC to add a debug info file for a resource. But
	after some thought I changed strategy so that the per-resource function
	can choose the name. Also avoids it showing up as an empty file in
	the info directory for other resources.

=== 31 ===

0029:	No changes.



Background
----------

Telemetry features are being implemented in conjunction with the
IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
counts for various events to a collector in a nearby OOBMSM device to be
accumulated with counts for each <RMID, event> pair received from other
CPUs. Cores send event counts when the RMID value changes, or after each
2ms elapsed time.

Each OOBMSM device may implement multiple event collectors with each
servicing a subset of the logical CPUs on a package.  In the initial
hardware implementation, there are two categories of events: energy
and perf.

1) Energy - Two counters
core_energy: This is an estimate of Joules consumed by each core. It is
calculated based on the types of instructions executed, not from a power
meter. This counter is useful to understand how much energy a workload
is consuming.

activity: This measures "accumulated dynamic capacitance". Users who
want to optimize energy consumption for a workload may use this rather
than core_energy because it provides consistent results independent of
any frequency or voltage changes that may occur during the runtime of
the application (e.g. entry/exit from turbo mode).

2) Performance - Seven counters
These are similar events to those available via the Linux "perf" tool,
but collected in a way with much lower overhead (no need to collect data
on every context switch).

stalls_llc_hit - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which hit in the LLC

c1_res - Counts the total C1 residency across all cores. The underlying
counter increments on 100MHz clock ticks

unhalted_core_cycles - Counts the total number of unhalted core clock
cycles

stalls_llc_miss - Counts the total number of unhalted core clock cycles
when the core is stalled due to a demand load miss which missed all the
local caches

c6_res - Counts the total C6 residency. The underlying counter increments
on crystal clock (25MHz) ticks

unhalted_ref_cycles - Counts the total number of unhalted reference clock
(TSC) cycles

uops_retired - Counts the total number of uops retired

The counters are arranged in groups in MMIO space of the OOBMSM device.
E.g. for the energy counters the layout is:

Offset: Counter
0x00	core energy for RMID 0
0x08	core activity for RMID 0
0x10	core energy for RMID 1
0x18	core activity for RMID 1
...

Enumeration
-----------

The only CPUID based enumeration for this feature is the legacy
CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
IA32_PQR_ASSOC MSR and the RMID field within it.

The OOBMSM driver discovers which features are present via
PCIe VSEC capabilities. Each feature is tagged with a unique
identifier. These identifiers indicate which XML description file from
https://github.com/intel/Intel-PMT describes which event counters are
available and their layout within the MMIO BAR space of the OOBMSM device.

Resctrl User Interface
----------------------

Because there may be multiple OOBMSM collection agents per processor
package, resctrl accumulates event counts from all agents on a package
and presents a single value to users. This will provide a consistent
user interface on future platforms that vary the number of collectors,
or the mappings from logical CPUs to collectors.

Users will continue to see the legacy monitoring files in the "L3"
directories and the telemetry files in the new "PERF_PKG" directories
(with each file providing the aggregated value from all OOBMSM collectors
on that package).

$ tree /sys/fs/resctrl/mon_data/
/sys/fs/resctrl/mon_data/
├── mon_L3_00
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_L3_01
│   ├── llc_occupancy
│   ├── mbm_local_bytes
│   └── mbm_total_bytes
├── mon_PERF_PKG_00
│   ├── activity
│   ├── c1_res
│   ├── c6_res
│   ├── core_energy
│   ├── stalls_llc_hit
│   ├── stalls_llc_miss
│   ├── unhalted_core_cycles
│   ├── unhalted_ref_cycles
│   └── uops_retired
└── mon_PERF_PKG_01
    ├── activity
    ├── c1_res
    ├── c6_res
    ├── core_energy
    ├── stalls_llc_hit
    ├── stalls_llc_miss
    ├── unhalted_core_cycles
    ├── unhalted_ref_cycles
    └── uops_retired

Resctrl Implementation
----------------------

The OOBMSM driver exposes "intel_pmt_get_regions_by_feature()"
that returns an array of structures describing the per-RMID groups it
found from the VSEC enumeration. Linux looks at the unique identifiers
for each group and enables resctrl for all groups with known unique
identifiers.

The memory map for the counters for each <RMID, event> pair is described
by the XML file. This is too unwieldy to use in the Linux kernel, so a
simplified representation is built into the resctrl code. Note that the
counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
and IA32_QM_CTR MSRs. This means there is no need for cross-processor
calls to read counters from a CPU in a specific domain. The counters
can be read from any CPU.

High level description of code changes:

1) New scope RESCTRL_PACKAGE
2) New struct rdt_resource RDT_RESOURCE_PERF_PKG
3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
        switch (r->rid) {
        case RDT_RESOURCE_L3:
                helper for L3
                break;
        case RDT_RESOURCE_PERF_PKG:
                helper for PKG
                break;
        }
4) New source code file "intel_aet.c" for the code to enumerate, configure, and report event counts.

With only one platform providing this feature, it's tricky to tell
exactly where it is going to go. I've made the event definitions
platform specific (based on the unique ID from the VSEC enumeration). It
seems possible/likely that the list of events may change from generation
to generation.

I've picked names for events based on the descriptions in the XML file.

Signed-off-by: Tony Luck <tony.luck@...el.com>

Tony Luck (29):
  x86,fs/resctrl: Consolidate monitor event descriptions
  x86,fs/resctrl: Replace architecture event enabled checks
  x86/resctrl: Remove 'rdt_mon_features' global variable
  x86,fs/resctrl: Prepare for more monitor events
  x86/rectrl: Fake OOBMSM interface
  x86,fs/resctrl: Improve domain type checking
  x86,fs/resctrl: Rename some L3 specific functions
  x86/resctrl: Move L3 initialization out of domain_add_cpu_mon()
  x86,fs/resctrl: Refactor domain_remove_cpu_mon() ready for new domain
    types
  x86/resctrl: Change generic domain functions to use struct
    rdt_domain_hdr
  x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain
  fs/resctrl: Make event details accessible to functions when reading
    events
  x86,fs/resctrl: Handle events that can be read from any CPU
  x86,fs/resctrl: Support binary fixed point event counters
  fs/resctrl: Add an architectural hook called for each mount
  x86/resctrl: Add and initialize rdt_resource for package scope core
    monitor
  x86/resctrl: Discover hardware telemetry events
  x86/resctrl: Count valid telemetry aggregators per package
  x86/resctrl: Complete telemetry event enumeration
  x86,fs/resctrl: Fill in details of Clearwater Forest events
  x86/resctrl: x86/resctrl: Read core telemetry events
  x86,fs/resctrl: Handle domain creation/deletion for
    RDT_RESOURCE_PERF_PKG
  x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
  x86/resctrl: Add energy/perf choices to rdt boot option
  x86/resctrl: Handle number of RMIDs supported by telemetry resources
  x86,fs/resctrl: Move RMID initialization to first mount
  fs/resctrl: Add file system mechanism for architecture info file
  x86/resctrl: Add info/PERF_PKG_MON/status file
  x86/resctrl: Update Documentation for package events

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/filesystems/resctrl.rst         |  53 ++-
 include/linux/resctrl.h                       |  97 ++++-
 include/linux/resctrl_types.h                 |  14 +
 arch/x86/include/asm/resctrl.h                |  16 -
 .../cpu/resctrl/fake_intel_aet_features.h     |  73 ++++
 arch/x86/kernel/cpu/resctrl/internal.h        |  30 +-
 fs/resctrl/internal.h                         |  87 ++--
 arch/x86/kernel/cpu/resctrl/core.c            | 314 ++++++++++----
 .../cpu/resctrl/fake_intel_aet_features.c     |  97 +++++
 arch/x86/kernel/cpu/resctrl/intel_aet.c       | 388 ++++++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c         |  67 +--
 fs/resctrl/ctrlmondata.c                      | 108 ++++-
 fs/resctrl/monitor.c                          | 262 +++++++-----
 fs/resctrl/rdtgroup.c                         | 259 ++++++++----
 arch/x86/Kconfig                              |   2 +-
 arch/x86/kernel/cpu/resctrl/Makefile          |   2 +
 17 files changed, 1439 insertions(+), 432 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.h
 create mode 100644 arch/x86/kernel/cpu/resctrl/fake_intel_aet_features.c
 create mode 100644 arch/x86/kernel/cpu/resctrl/intel_aet.c


base-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
base-branch: x86/cache
base-commit: 54d14f25664bbb75c2928dd0d64a095c0f488176
-- 
2.49.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ