linux-kernel - Re: [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for package events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3780808a-c6b5-45ef-ab31-f8ce1153e9b6@intel.com>
Date: Fri, 3 Oct 2025 17:25:58 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Tony Luck <tony.luck@...el.com>, Fenghua Yu <fenghuay@...dia.com>, "Maciej
 Wieczor-Retman" <maciej.wieczor-retman@...el.com>, Peter Newman
	<peternewman@...gle.com>, James Morse <james.morse@....com>, Babu Moger
	<babu.moger@....com>, Drew Fustini <dfustini@...libre.com>, Dave Martin
	<Dave.Martin@....com>, Chen Yu <yu.c.chen@...el.com>
CC: <x86@...nel.org>, <linux-kernel@...r.kernel.org>,
	<patches@...ts.linux.dev>
Subject: Re: [PATCH v11 30/31] x86,fs/resctrl: Update Documentation for
 package events

Hi Tony,

Two nits in subject:
"Documentation" -> "documentation"
"package events" -> "telemetry events"?
(this is the one and only instance of "package event" in this
series and does not match changelog that follows)

On 9/25/25 1:03 PM, Tony Luck wrote:
> Update resctrl filesystem documentation with the details about the
> resctrl files that support telemetry events.
> 
> Signed-off-by: Tony Luck <tony.luck@...el.com>
> ---
>  Documentation/filesystems/resctrl.rst | 100 ++++++++++++++++++++++----
>  1 file changed, 87 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
> index 006d23af66e1..cb6da9614f58 100644
> --- a/Documentation/filesystems/resctrl.rst
> +++ b/Documentation/filesystems/resctrl.rst
> @@ -168,13 +168,12 @@ with respect to allocation:
>  			bandwidth percentages are directly applied to
>  			the threads running on the core
>  
> -If RDT monitoring is available there will be an "L3_MON" directory
> +If L3 monitoring is available there will be an "L3_MON" directory
>  with the following files:
>  
>  "num_rmids":
> -		The number of RMIDs available. This is the
> -		upper bound for how many "CTRL_MON" + "MON"
> -		groups can be created.
> +		The number of RMIDs supported by hardware for
> +		L3 monitoring events.
>  
>  "mon_features":
>  		Lists the monitoring events if
> @@ -400,6 +399,19 @@ with the following files:
>  		bytes) at which a previously used LLC_occupancy
>  		counter can be considered for re-use.
>  
> +If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
> +with the following files:
> +
> +"num_rmids":
> +		The number of RMIDs supported by hardware for
> +		telemetry monitoring events.

There may be some additional detail about how num_rmids is determined that could be valuable
to user space since from what I understand user space seems to have some control over this
number in addition to it being "supported by hardware".

For example, if the PERF event group has more RMID than the ENERGY event group
and the user needs to do significant monitoring of PERF then it may be useful to know
that by disabling ENERGY it could be possible to increase the number of RMIDs in order
to do that monitoring.

Additionally, from patch #23 we learned that "supported by hardware" can have different meanings ...
it could be the number of RMIDs "supported" or it could mean the number of RMIDs
that can be reliably "counted". A user force-enabling an under resourced event group will
thus encounter a num_rmids that does not match the (XML) spec.

> +
> +"mon_features":
> +		Lists the telemetry monitoring events that are enabled on this system.
> +
> +The upper bound for how many "CTRL_MON" + "MON" can be created
> +is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
> +
>  Finally, in the top level of the "info" directory there is a file
>  named "last_cmd_status". This is reset with every "command" issued
>  via the file system (making new directories or writing to any of the
> @@ -505,15 +517,40 @@ When control is enabled all CTRL_MON groups will also contain:
>  When monitoring is enabled all MON groups will also contain:
>  
>  "mon_data":
> -	This contains a set of files organized by L3 domain and by
> -	RDT event. E.g. on a system with two L3 domains there will
> -	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
> -	directories have one file per event (e.g. "llc_occupancy",
> -	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> -	files provide a read out of the current value of the event for
> -	all tasks in the group. In CTRL_MON groups these files provide
> -	the sum for all tasks in the CTRL_MON group and all tasks in
> -	MON groups. Please see example section for more details on usage.
> +	This contains directories for each monitor domain. One set for
> +	each instance of an L3 cache, another set for each processor
> +	package. The L3 cache directories are named "mon_L3_00",

I still do not understand the "set" terminology. There is just one directory
per domain, no? For example, "This contains a directory for each monitoring domain of
a monitoring capable resource. One directory for each instance of an L3 cache
if L3 monitoring is available, another directory for each processor package if
telemetry monitoring is available."

> +	"mon_L3_01" etc. The package directories "mon_PERF_PKG_00",
> +	"mon_PERF_PKG_01" etc.
> +
> +	Within each directory there is one file per event. For
> +	example the L3 directories may contain "llc_occupancy", "mbm_total_bytes",
> +	and "mbm_local_bytes". The PERF_PKG directories may contain "core_energy",
> +	"activity", etc. The info/`*`/mon_features files provide the full
> +	list of event/file names.
> +
> +	"core energy" reports a floating point number for the energy (in Joules)
> +	consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
> +	during execution of instructions summed across all logical CPUs on a
> +	package for the current RMID.
> +
> +	"activity" also reports a floating point value (in Farads).
> +	This provides an estimate of work done independent of the
> +	frequency that the CPUs used for execution.
> +
> +	Note that these two counters only measure energy/activity

To help be specific:
""core energy" and "activity" only measure ..."

> +	in the "core" of the CPU (arithmetic units, TLB, L1 and L2
> +	caches, etc.). They do not include L3 cache, memory, I/O
> +	devices etc.
> +
> +	All other events report decimal integer values.
> +
> +	In a MON group these files provide a read out of the current
> +	value of the event for all tasks in the group. In CTRL_MON groups
> +	these files provide the sum for all tasks in the CTRL_MON group
> +	and all tasks in MON groups. Please see example section for more
> +	details on usage.
> +

Please have this text line length be consistent with surrounding text.

>  	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
>  	directories for each node (located within the "mon_L3_XX" directory
>  	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
> @@ -1506,6 +1543,43 @@ Example with C::
>      resctrl_release_lock(fd);
>    }
>  
> +Debugfs
> +=======
> +In addition to the use of debugfs for tracing of pseudo-locking
> +performance, architecture code may create debugfs directories
> +associated with monitoring features for a specific resource.
> +
> +The full pathname for these is in the form:
> +
> +    /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
> +
> +The presence, names, and format of these files may vary
> +between architectures even if the same resource is present.
> +
> +PERF_PKG_MON/x86_64
> +-------------------
> +Three files are present per telemetry aggregator instance
> +that show status.  The prefix of

Please be consistent with line length and do not trim lines so short.

> +each file name describes the type ("energy" or "perf") which
> +processor package it belongs to, and the instance number of
> +the aggregator. For example: "energy_pkg1_agg2".
> +
> +The suffix describes which data is reported in the file and
> +is one of:
> +
> +data_loss_count:
> +	This counts the number of times that this aggregator
> +	failed to accumulate a counter value supplied by a CPU.
> +
> +data_loss_timestamp:
> +	This is a "timestamp" from a free running 25MHz uncore
> +	timer indicating when the most recent data loss occurred.
> +
> +last_update_timestamp:
> +	Another 25MHz timestamp indicating when the
> +	most recent counter update was successfully applied.
> +
> +
>  Examples for RDT Monitoring along with allocation usage
>  =======================================================
>  Reading monitored data

Reinette