[<prev] [next>] [day] [month] [year] [list]
Message-ID: <802b22da-199a-a724-972b-9bc0cabd43fb@huawei.com>
Date: Wed, 1 Dec 2021 17:45:12 +0800
From: Chengchang Tang <tangchengchang@...wei.com>
To: <Brice.Goglin@...ia.fr>
CC: <hwloc-devel@...ts.open-mpi.org>, <linux-kernel@...r.kernel.org>,
<song.bao.hua@...ilicon.com>, <linuxarm@...wei.com>,
"shenyang (M)" <shenyang39@...wei.com>,
Jonathan Cameron <jonathan.cameron@...wei.com>,
yangyicong <yangyicong@...wei.com>
Subject: [RFC] hwloc: Add support for exporting latency, bandwidth topology
through calibration
Currently, hwloc can export hardware and network locality for
applications to obtain and set their affinity. However, in many
scenarios, the information provided by the topology is not enough, for
example, it cannot reflect the actual memory latency and bandwidth data
between different schedule domain. We hope to provide more detailed and
precise information of HW capabilities in hwloc by adding several new
calibration tools, so that application can achieve a more refined design
to achieve higher performance and fully tap the capabilities of the HW.
We mainly focus on exposing memory/bus bandwidth, cache coherence/bus
communication latency etc to users. Those topology information has
neither standard ACPI nor dts interface to export, but they can be
beneficial of user applications. Some examples,
1. the memory bandwidth while we spread tasks between multiple clusters
vs. gather them in one cluster
2. the memory bandwidth while we spread tasks between multiple NUMA
nodes vs. gather them in one NUMA
3. the cache synchronization latency while we spread tasks between
multiple clusters vs. gather them in one cluster
4. the cache synchronization latency while we spread tasks between
multiple NUMA nodes vs. gather them in one NUMA node
5. bus bandwidth and congestion in complex topology, for example, for
the below topology
node 1 - node0 - node2 - node3
the bus between node0 and node2 might become bottleneck as the
communications between node1 and node3 also depend on it.
numa distance can't describe this kind of complex bus topology at all.
6. I/O bandwidth and latency while we access I/O devices such as
accelerators, networks, storages from the NUMA node which devices belong
to vs. from different NUMA nodes.
...
If possible, we also can export more such as IPC bandwidth and
latency(for example, pipe), spinlock/mutex latency etc. Calibration
tools will provide these data about different entities at some certain
topology levels so that application could select the spreading and
gathering strategy of threads according to this data.
The design of the calibration tool will be similar to netloc. Three
steps are required to use the calibration tool.
The first step is to get data about system bandwidth, latency, etc by
running some benchmark tests since the standard operating system does
not support providing this information. The raw data will be saved in
files. This step may need to be performed by a privilege user.
The second step is to convert the original file generated in the
previous step into a file in a readable format by the calibration tool.
No privileges are required for this step.
In the third step, the application could obtain the calibration
information of the system through a C APIs exposed by calibration tool
and hwloc commands can be also extended to show these new information.
The source of the calibration data is the readable file generated in the
second step. E.g. hwloc_get_mem_bandwidth(hwloc_topology_t topology,
unsigned idx1, unsigned idx2) could be used to get the memory bandwidth
ability between idx1 and idx2 in some topology type.
Powered by blists - more mailing lists