linux-kernel - Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53dcb55c-f5b6-4cb8-96b6-07aa1ba1d4d4@intel.com>
Date: Fri, 18 Apr 2025 14:13:39 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Tony Luck <tony.luck@...el.com>, Fenghua Yu <fenghuay@...dia.com>, "Maciej
 Wieczor-Retman" <maciej.wieczor-retman@...el.com>, Peter Newman
	<peternewman@...gle.com>, James Morse <james.morse@....com>, Babu Moger
	<babu.moger@....com>, Drew Fustini <dfustini@...libre.com>, Dave Martin
	<Dave.Martin@....com>, Anil Keshavamurthy <anil.s.keshavamurthy@...el.com>
CC: <linux-kernel@...r.kernel.org>, <patches@...ts.linux.dev>
Subject: Re: [PATCH v3 00/26] x86/resctrl telemetry monitoring

Hi Tony,

On 4/7/25 4:40 PM, Tony Luck wrote:
> Previous version here:
> https://lore.kernel.org/all/20250321231609.57418-1-tony.luck@intel.com/
> 
> This series is based on James Morse's "fs/resctrl/" snapshot.

Would be helpful to provide link to snapshot used to avoid any uncertainty
about what base to use.

> 
> Background
> 
> Telemetry features are being implemented in conjunction with the
> IA32_PQR_ASSOC.RMID value on each logical CPU. This is used to send
> counts for various events to a collector in a nearby OOMMSM device to be
> accumulated with counts for each <RMID, event> pair received from other
> CPUs. Cores send event counts when the RMID value changes, or after each
> 2ms elapsed time.
> 
> Each OOBMSM device may implement multiple event collectors with each
> servicing a subset of the logical CPUs on a package.  In the initial
> hardware implementation, there are two categories of events:
> 

(missing the two categories of events)

> The counters are arranged in groups in MMIO space of the OOBMSM device.
> E.g. for the energy counters the layout is:
> 
> Offset: Counter
> 0x00	core energy for RMID 0
> 0x08	core activity for RMID 0
> 0x10	core energy for RMID 1
> 0x18	core activity for RMID 1
> 
> 1) Energy - Two counters
> core_energy: This is an estimate of Joules consumed by each core. It is
> calculated based on the types of instructions executed, not from a power
> meter. This counter is useful to understand how much energy a workload
> is consuming.
> 
> activity: This measures "accumulated dynamic capacitance". Users who
> want to optimize energy consumption for a workload may use this rather
> than core_energy because it provides consistent results independent of
> any frequency or voltage changes that may occur during the runtime of
> the application (e.g. entry/exit from turbo mode).
> 
> 2) Performance - Seven counters
> These are similar events to those available via the Linux "perf" tool,
> but collected in a way with mush lower overhead (no need to collect data

"mush" -> "much"

> on every context switch).
> 
> stalls_llc_hit - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which hit in the LLC
> 
> c1_res - Counts the total C1 residency across all cores. The underlying
> counter increments on 100MHz clock ticks
> 
> unhalted_core_cycles - Counts the total number of unhalted core clock
> cycles
> 
> stalls_llc_miss - Counts the total number of unhalted core clock cycles
> when the core is stalled due to a demand load miss which missed all the
> local caches
> 
> c6_res - Counts the total C6 residency. The underlying counter increments
> on crystal clock (25MHz) ticks
> 
> unhalted_ref_cycles - Counts the total number of unhalted reference clock
> (TSC) cycles
> 
> uops_retired - Counts the total number of uops retired
> 
> Enumeration
> 
> The only CPUID based enumeration for this feature is the legacy
> CPUID(eax=7,ecx=0).ebx{12} that indicates the presence of the
> IA32_PQR_ASSOC MSR and the RMID field within it.
> 
> The OOBMSM driver discovers which features are present via
> PCIe VSEC capabilities. Each feature is tagged with a unique
> identifier. These identifiers indicate which XML description file from
> https://github.com/intel/Intel-PMT describes which event counters are
> available and their layout within the MMIO BAR space of the OOBMSM device.
> 
> Resctrl User Interface
> 
> Because there may be multiple OOBMSM collection agents per processor
> package, resctrl accumulates event counts from all agents on a package
> and presents a single value to users. This will provide a consistent
> user interface on future platforms that vary the number of collectors,
> or the mappings from logical CPUs to collectors.
> 
> Users will see the legacy monitoring files in the "L3" directories
> and the telemetry files in "PKG" directories (with each file

Now PERF_PKG?

> providing the aggregated value from all OOBMSM collectors on that
> package).
> 
> $ tree /sys/fs/resctrl/mon_data/
> /sys/fs/resctrl/mon_data/
> ├── mon_L3_00
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_L3_01
> │   ├── llc_occupancy
> │   ├── mbm_local_bytes
> │   └── mbm_total_bytes
> ├── mon_PKG_00
> │   ├── activity
> │   ├── c1_res
> │   ├── c6_res
> │   ├── core_energy
> │   ├── stalls_llc_hit
> │   ├── stalls_llc_miss
> │   ├── unhalted_core_cycles
> │   ├── unhalted_ref_cycles
> │   └── uops_retired
> └── mon_PKG_01
>     ├── activity
>     ├── c1_res
>     ├── c6_res
>     ├── core_energy
>     ├── stalls_llc_hit
>     ├── stalls_llc_miss
>     ├── unhalted_core_cycles
>     ├── unhalted_ref_cycles
>     └── uops_retired
> 
> Resctrl Implementation
> 
> The OOBMSM driver exposes a function "intel_pmt_get_regions_by_feature()"

(nit: no need to use "a function" if using ())

> that returns an array of structures describing the per-RMID groups it
> found from the VSEC enumeration. Linux looks at the unique identifiers
> for each group and enables resctrl for all groups with known unique
> identifiers.
> 
> The memory map for the counters for each <RMID, event> pair is described
> by the XML file. This is too unwieldy to use in the Linux kernel, so a
> simplified representation is built into the resctrl code. Note that the
> counters are in MMIO space instead of accessed using the IA32_QM_EVTSEL
> and IA32_QM_CTR MSRs. This means there is no need for cross-processor
> calls to read counters from a CPU in a specific domain. The counters
> can be read from any CPU.
> 
> High level description of code changes:
> 
> 1) New scope RESCTRL_PACKAGE
> 2) New struct rdt_resource RDT_RESOURCE_INTEL_PMT
> 3) Refactor monitor code paths to split existing L3 paths from new ones. In some cases this ends up with:
>         switch (r->rid) {
>         case RDT_RESOURCE_L3:
>                 helper for L3
>                 break;
>         case RDT_RESOURCE_INTEL_PMT:
>                 helper for PKG
>                 break;
>         }
> 4) New source code file "intel_pmt.c" for the code to enumerate, configure, and report event counts.

Needs an update to match new version of this work.

> 
> With only one platform providing this feature, it's tricky to tell
> exactly where it is going to go. I've made the event definitions
> platform specific (based on the unique ID from the VSEC enumeration). It
> seems possible/likely that the list of events may change from generation
> to generation.
> 
> I've picked names for events based on the descriptions in the XML file.

One aspect that is only hinted to in the final documentation patch is
how users are expected to use this feature. As I understand the number of
monitor groups supported by resctrl is still guided by the number of RMIDs
supported by L3 monitoring. This work hints that the telemetry feature may
not match that number of RMIDs and a monitor group may thus exist but
when a user attempts to ready any of these perf files it will return
"unavailable".

The series attempts to address it by placing the number of RMIDs available
for this feature in a "num_rmids" file, but since the RMID assigned to a monitor
group is not exposed to user space (unless debugging enabled) the user does
not know if a monitor group will support this feature or not. This seems awkward
to me. Why not limit the number of monitor groups that can be created to the
minimum number of RMIDs across these resources like what is done for CLOSid?

Reinette