[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241121101845.1815660-5-Jonathan.Cameron@huawei.com>
Date: Thu, 21 Nov 2024 10:18:45 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: <linux-cxl@...r.kernel.org>, <linux-mm@...ck.org>,
<linux-perf-users@...r.kernel.org>, <linux-kernel@...r.kernel.org>
CC: <linuxarm@...wei.com>, <tongtiangen@...wei.com>, Yicong Yang
<yangyicong@...wei.com>, Niyas Sait <niyas.sait@...wei.com>,
<ajayjoshi@...ron.com>, Vandana Salve <vsalve@...ron.com>, Davidlohr Bueso
<dave@...olabs.net>, Dave Jiang <dave.jiang@...el.com>, Alison Schofield
<alison.schofield@...el.com>, Ira Weiny <ira.weiny@...el.com>, Dan Williams
<dan.j.williams@...el.com>, Alexander Shishkin
<alexander.shishkin@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Arnaldo Carvalho de Melo <acme@...nel.org>,
Mark Rutland <mark.rutland@....com>, Gregory Price <gourry@...rry.net>, Huang
Ying <ying.huang@...el.com>
Subject: [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver
Add basic documentation to describe the CXL HMU and the
perf AUX buffer based interfaces.
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
---
Documentation/trace/cxl-hmu.rst | 197 ++++++++++++++++++++++++++++++++
Documentation/trace/index.rst | 1 +
2 files changed, 198 insertions(+)
diff --git a/Documentation/trace/cxl-hmu.rst b/Documentation/trace/cxl-hmu.rst
new file mode 100644
index 000000000000..f07a50ba608c
--- /dev/null
+++ b/Documentation/trace/cxl-hmu.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+CXL Hotness Monitoring Unit Driver
+==================================
+
+CXL r3.2 introduced the CXL Hotness Monitoring Unit (CHMU). A CHMU allows
+software running on a CXL Host to identify hot memory ranges, that is those with
+higher access frequency relative to other memory ranges.
+
+A given Logical Device (presentation of a CXL memory device seen by a particular
+host) can provide 1 or more CHMU each of which supports 1 or more separately
+programmable CHMU Instances (CHMUI). These CHMUI are mostly independent with
+the exception that there can be restrictions on them tracking the same memory
+regions. The CHMUs are always completely independent.
+The naming of the units is cxl_hmu_memX.Y.Z where memX matches the naming
+of the memory device in /sys/bus/cxl/devices/, Y is the CHMU index and
+Z is the CHMUI index with the CHMU.
+
+Each CHMUI provides a ring buffer structure known as the Hot List from which the
+host an read back entries that describe the hotness of particular region of
+memory (Hot List Units). The Hot List Unit combines a Unit Address and an access
+count for the particular address. Unit address to DPA requires multiplication
+by the unit size. Thus, for large unit sizes the device may support higher
+counts. It is these Hot List Units that the driver provides via a perf AUX
+buffer by copying them from PCI BAR space.
+
+The unit size at which hotness is measured is configurable for each CHMUI and
+all measurement is done in Device Physical Address space. To relate this to
+Host Physical Address space the HDM (Host-Managed Device Memory) decoder
+configuration must be taken into account to reflect the placement in a
+CXL Fixed Memory Window and any interleaving.
+
+The CHMUI can support interrupts on fills above a watermark, or on overflow
+of the hotlist.
+
+A CHMUI can support two different basic modes of operation. Epoch and
+Always On. These affect what is placed on the hotlist. Note that the actual
+implementation of tracking is implementation defined and likely to be
+inherently imprecise in that the hottest pages may not be discovered due to
+resource exhaustion and the hotness counts may not represent accurately how
+hot they are. The specification allows for a very high degree of flexibility
+in implementation, important as it is likely that a number of different
+hardware implementations will be chosen to suit particular silicon and accuracy
+budgets.
+
+Operation and configuration
+===========================
+
+An example command line is::
+
+ $perf record -a -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
+ hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
+ range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
+ hotness_granual=12
+
+ $perf report --dump-raw-traces
+
+which will produce a list of hotlist entries, one per line with a short header
+to provide sufficient information to interpret the entries::
+
+ . ... CXL_HMU data: size 33512 bytes
+ Header 0: units: 29c counter_width 10
+ Header 1 : deadbeef
+ 0000000000000283
+ 0000000000010364
+ 0000000000020366
+ 000000000003033c
+ 0000000000040343
+ 00000000000502ff
+ 000000000006030d
+ 000000000007031a
+ ...
+
+The least significant counter_width bits (here 16, hex 10) are the counter
+value, all higher bits are the unit index. Multiply by the unit size
+to get a Device Physical Address.
+
+The parameters are as follows:
+
+epoch_type
+----------
+
+Two values may be supported::
+
+ 0 - Epoch based operation
+ 1 - Always on operation
+
+
+0. Epoch Based Operation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+An Epoch is a period of time after which a counter is assessed for hotness.
+
+The device may have a global sense of an Epoch but it may also operate them on
+a per counter, or per region of device basis. This is a function of the
+implementation and is not controllable, but is discoverable. In a global Epoch
+scheme at start of each Epoch all counters are zeroed / deallocated. Counters
+are then allocated in a hardware specific manner and accesses counted. At the
+completion of the Epoch the counters are compared with a threshold and entries
+with a count above a configurable threshold are added to the hotlist. A new
+Epoch is then begun with all counters cleared.
+
+In non-global Epoch scheme, when the Epoch of a given counter begins is not
+specified. An example might be an Epoch for counter only starting on first
+touch to the relevant memory region. When a local Epoch ends the counter is
+compared to the threshold and if appropriate added to the hotlist.
+
+Note, in Epoch Based Operation, the counter in the hotlist entry provides
+information on how hot the memory is as the counter for the full Epoch is
+provided.
+
+1. Always on Operation
+~~~~~~~~~~~~~~~~~~~~~~
+
+In this mode, counters may all be reset before enabling the CHMUI. Then
+counters are allocated to particular memory units via an hardware specific
+method, perhaps on first touch. When a counter passes the configurable
+hotness threshold an entry is added to the hotlist and that counter is freed
+for reuse.
+
+In this scheme the count provided in the hotlist entry is not useful as it will
+depend only on the configured threshold.
+
+access_type
+-----------
+
+The parameter controls which access are counted::
+
+ 1 - Non-TEE read only
+ 2 - Non-TEE write only
+ 3 - Non-TEE read and write
+ 4 - TEE and Non-TEE read only
+ 5 - TEE and Non-TEE write only
+ 6 - TEE and Non-tee read and write
+
+
+TEE here refers to a trusted execution environment, specifically one that
+results in the T bit being set in the CXL transactions.
+
+
+hotness_granual
+---------------
+
+Unit size at which tracking is performed. Must be at least 256 bytes but
+hardware may only support some sizes. Expressed as a power of 2. e.g. 12 = 4kiB.
+
+hotness_threshold
+-----------------
+
+This is the minimum counter value that must be reached for the unit to count as
+hot and be added to the hotlist.
+
+The possible range may be dependent on the unit size as a larger unit size
+requires more bits on the hotlist entry leaving fewer available for the hotness
+counter.
+
+epoch_multiplier and epoch_scale
+--------------------------------
+
+The length of an epoch (in epoch mode) is controlled by these two parameters
+with the decoded epoch_scale multiplied by the epoch_multiplier to give the
+overall epoch length.
+
+epoch_scale::
+
+ 1 - 100 usecs
+ 2 - 1 msec
+ 3 - 10 msecs
+ 4 - 100 msecs
+ 5 - 1 second
+
+range_base and range_scale
+--------------------------
+
+Expressed in terms of the unit size set via hotness_granual. Each CHMUI has a
+bitmap that controls what Device Physical Address spaces is tracked. Each bit
+represents 256MiB of DPA space.
+
+This interface provides a simple base and size in units of 256MiB to configure
+this bitmap. All bits in the specified range will be set.
+
+downsampling_factor
+-------------------
+
+Hardware may be incapable of counting accesses at full speed or it may be
+desirable to count over a longer period during which the counters would
+overflow. This control allows selection of a down sampling factor expressed
+as a power of 2 between 1 and 32768. Default is minimum supported downsampling
+factor.
+
+randomized_downsampling
+-----------------------
+
+To avoid problems with downsampling when accesses are periodic this option
+allows for an implementation defined randomization of the sampling interval,
+whilst remaining close to the specified downsampling_factor.
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 0b300901fd75..b35ed8e9dfa9 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -36,3 +36,4 @@ Linux Tracing Technologies
user_events
rv/index
hisi-ptt
+ cxl-hmu
--
2.43.0
Powered by blists - more mailing lists